Friday, 15 March 2019

Easiest Way to Benchmark Spark + Alluxio + S3 Stack With TPC-DS Queries on AWS

The Alluxio sandbox is the easiest way to test drive the popular data analytics stack of Spark, Alluxio, and S3 deployed in a multi-node cluster in a public cloud environment. The sandbox cluster is fully configured and ready for users to run applications ranging from the hello-world example to the TPC-DS benchmark suite. Don’t take our word for it; kick off the benchmark yourself to see the performance benefits of running Spark jobs that interface through Alluxio on S3 compared to running Spark jobs directly on S3. It is extremely easy to request and launch a sandbox cluster as a playground for 24 hours at no cost to you.

Cluster Details

The sandbox cluster consists of 2 master nodes and 4 workers nodes using r4.2xlarge EC2 instances. Alluxio, currently at version 1.8.1, is configured to use a S3 bucket as its root under file storage. It is deployed in high availability mode with leading and backup master nodes. To run TPC-DS, Apache Spark is deployed with its master on the first master node and a worker on each of the worker nodes. Note that the Spark workers are co-located with the Alluxio workers to possibly leverage data locality provided by Alluxio local file system in memory.



from DZone.com Feed https://ift.tt/2Jcmm16

No comments:

Post a Comment