{"templateId":"markdown","sharedDataIds":{},"props":{"metadata":{"markdoc":{"tagList":[]},"redocly_category":"Integrations","type":"markdown"},"seo":{"title":"Amazon Elastic MapReduce","description":"Treasure Data Product Documentation · Collect and Unify · Segment and Activate · Experiment and Analyze · Decisioning Automate with AI Scale and Trust.","siteUrl":"https://docs.treasuredata.com","lang":"en-US","llmstxt":{"hide":false,"sections":[{"title":"Table of contents","includeFiles":["**/*"],"excludeFiles":[]}],"excludeFiles":[]}},"dynamicMarkdocComponents":[],"compilationErrors":[],"ast":{"$$mdtype":"Tag","name":"article","attributes":{},"children":[{"$$mdtype":"Tag","name":"Heading","attributes":{"level":1,"id":"amazon-elastic-mapreduce","__idx":0},"children":["Amazon Elastic MapReduce"]},{"$$mdtype":"Tag","name":"p","attributes":{},"children":["You can use the Apache Spark Driver for Treasure Data (also known as td-spark) on Amazon Elastic MapReduce (EMR). Although we recommend using the ",{"$$mdtype":"Tag","name":"strong","attributes":{},"children":["us-east"]}," region of Amazon EC2 for the optimal performance, you can use td-spark in other Spark environments as well."]},{"$$mdtype":"Tag","name":"Heading","attributes":{"level":2,"id":"overview","__idx":1},"children":["Overview"]},{"$$mdtype":"Tag","name":"p","attributes":{},"children":["Refer to ",{"$$mdtype":"Tag","name":"MarkdownLink","attributes":{"href":"/products/customer-data-platform/data-workbench/queries/trino/query_faqs"},"children":["TD Spark FAQs"]}," for an overview of td-spark."]},{"$$mdtype":"Tag","name":"Heading","attributes":{"level":2,"id":"what-does-td-spark-enable","__idx":2},"children":["What Does td-spark Enable?"]},{"$$mdtype":"Tag","name":"ul","attributes":{},"children":[{"$$mdtype":"Tag","name":"li","attributes":{},"children":["Access Treasure Data from Spark in Scala and Python (PySpark)."]},{"$$mdtype":"Tag","name":"li","attributes":{},"children":["Pull Treasure Data tables into Spark as a DataFrame (No TD query is issued, providing the shortest latency path between TD stored data and your spark cluster possible)."]},{"$$mdtype":"Tag","name":"li","attributes":{},"children":["Issue Presto, Hive, or SparkSQL queries and return the result as a Spark DataFrame."]}]},{"$$mdtype":"Tag","name":"p","attributes":{},"children":["This driver is recommended for use with Spark 2.4.4 or higher"]},{"$$mdtype":"Tag","name":"Heading","attributes":{"level":2,"id":"recommendations-regarding-use","__idx":3},"children":["Recommendations Regarding Use"]},{"$$mdtype":"Tag","name":"p","attributes":{},"children":["For the fastest data access, and lowest data transfer costs, we recommend that you set-up your spark cluster in the AWS us-east region. Data Transfer costs may become quite high if using other AWS regions or processing environments."]},{"$$mdtype":"Tag","name":"Heading","attributes":{"level":2,"id":"td-spark-driver-on-emr","__idx":4},"children":["TD Spark Driver on EMR"]},{"$$mdtype":"Tag","name":"Heading","attributes":{"level":3,"id":"create-an-emr-spark-cluster","__idx":5},"children":["Create an EMR Spark Cluster"]},{"$$mdtype":"Tag","name":"ol","attributes":{},"children":[{"$$mdtype":"Tag","name":"li","attributes":{},"children":[{"$$mdtype":"Tag","name":"p","attributes":{},"children":["Create an EMR cluster with Spark support."]},{"$$mdtype":"Tag","name":"p","attributes":{},"children":["The ",{"$$mdtype":"Tag","name":"strong","attributes":{},"children":["us-east"]}," region is recommended to maximize data transfer performance from S3."]}]},{"$$mdtype":"Tag","name":"li","attributes":{},"children":[{"$$mdtype":"Tag","name":"p","attributes":{},"children":["Check the master node address of the new EMR."]},{"$$mdtype":"Tag","name":"p","attributes":{},"children":["Read table data as Spark DataFrame"]}]}]},{"$$mdtype":"Tag","name":"p","attributes":{},"children":[{"$$mdtype":"Tag","name":"img","attributes":{"src":"/assets/image2020-11-18_9-9-26.c6538e30038a6b2dd513a89175ca7d459cba989c4e8b5491b087998ef2f4e30e.d33fa2c4.png","alt":""},"children":[]}]},{"$$mdtype":"Tag","name":"p","attributes":{},"children":[{"$$mdtype":"Tag","name":"img","attributes":{"src":"/assets/image2020-11-18_9-11-11.ce52ded660f570f751fb8d590bf08c5bc1575a6656533e0c4587e5a23ce1ec39.d33fa2c4.png","alt":""},"children":[]}]},{"$$mdtype":"Tag","name":"p","attributes":{},"children":["If you created EMR with default security group (ElasticMapReduce-master), you need to permit inbound access from your environment. See ",{"$$mdtype":"Tag","name":"MarkdownLink","attributes":{"href":"http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-man-sec-groups.html"},"children":["Amazon EMR-Managed Security Groups"]},"."]},{"$$mdtype":"Tag","name":"Heading","attributes":{"level":3,"id":"reference","__idx":6},"children":["Reference"]},{"$$mdtype":"Tag","name":"ul","attributes":{},"children":[{"$$mdtype":"Tag","name":"li","attributes":{},"children":[{"$$mdtype":"Tag","name":"MarkdownLink","attributes":{"href":"http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-launch.html"},"children":["Create An EMR Cluster with Spark"]}]}]},{"$$mdtype":"Tag","name":"Heading","attributes":{"level":2,"id":"log-in-to-the-emr-cluster","__idx":7},"children":["Log-in to the EMR Cluster"]},{"$$mdtype":"Tag","name":"p","attributes":{},"children":[{"$$mdtype":"Tag","name":"MarkdownLink","attributes":{"href":"http://docs.aws.amazon.com//ElasticMapReduce/latest/ManagementGuide/emr-connect-master-node-ssh.html"},"children":["Connect to EMR Master node with SSH"]}]},{"$$mdtype":"Tag","name":"CodeBlock","attributes":{"data-language":"bash","header":{"controls":{"copy":{}}},"source":"# Use 8157 for SOCKS5 proxy port so that you can access EMR Spark job history page (port 18080), Zeppelin note book (port 8890), etc.\n$ ssh -i (your AWS key pair file. .pem) -D8157 hadoop@ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com\n     __|  __|_  )\n       _|  (     /   Amazon Linux AMI\n      ___|\\___|___|\nhttps://aws.amazon.com/amazon-linux-ami/2016.09-release-notes/\n4 package(s) needed for security, out of 6 available\nRun \"sudo yum update\" to apply all updates.\n\nEEEEEEEEEEEEEEEEEEEE MMMMMMMM           MMMMMMMM RRRRRRRRRRRRRRR\nE::::::::::::::::::E M:::::::M         M:::::::M R::::::::::::::R\nEE:::::EEEEEEEEE:::E M::::::::M       M::::::::M R:::::RRRRRR:::::R\n  E::::E       EEEEE M:::::::::M     M:::::::::M RR::::R      R::::R\n  E::::E             M::::::M:::M   M:::M::::::M   R:::R      R::::R\n  E:::::EEEEEEEEEE   M:::::M M:::M M:::M M:::::M   R:::RRRRRR:::::R\n  E::::::::::::::E   M:::::M  M:::M:::M  M:::::M   R:::::::::::RR\n  E:::::EEEEEEEEEE   M:::::M   M:::::M   M:::::M   R:::RRRRRR::::R\n  E::::E             M:::::M    M:::M    M:::::M   R:::R      R::::R\n  E::::E       EEEEE M:::::M     MMM     M:::::M   R:::R      R::::R\nEE:::::EEEEEEEE::::E M:::::M             M:::::M   R:::R      R::::R\nE::::::::::::::::::E M:::::M             M:::::M RR::::R      R::::R\nEEEEEEEEEEEEEEEEEEEE MMMMMMM             MMMMMMM RRRRRRR      RRRRRR\n","lang":"bash"},"children":[]},{"$$mdtype":"Tag","name":"Heading","attributes":{"level":2,"id":"set-up-td-spark-integration","__idx":8},"children":["Set Up TD Spark Integration"]},{"$$mdtype":"Tag","name":"p","attributes":{},"children":["Download td-spark jar file:"]},{"$$mdtype":"Tag","name":"CodeBlock","attributes":{"data-language":"bash","header":{"controls":{"copy":{}}},"source":"[hadoop@ip-x-x-x-x]$ wget https://s3.amazonaws.com/td-spark/td-spark-assembly_2.11-0.4.0.jar\n","lang":"bash"},"children":[]},{"$$mdtype":"Tag","name":"p","attributes":{},"children":["Create a ",{"$$mdtype":"Tag","name":"strong","attributes":{},"children":["td.conf"]}," file in the master node:"]},{"$$mdtype":"Tag","name":"CodeBlock","attributes":{"data-language":"bash","header":{"controls":{"copy":{}}},"source":"# Describe your TD API key here\nspark.td.apikey (your TD API key)\nspark.td.site (your site name: us, jp, ap02, eu01, etc.)\n# (recommended) this use KryoSerializer for faster performance\nspark.serializer org.apache.spark.serializer.KryoSerializer\n","lang":"bash"},"children":[]},{"$$mdtype":"Tag","name":"Heading","attributes":{"level":2,"id":"using-spark-shell-on-emr","__idx":9},"children":["Using spark-shell on EMR"]},{"$$mdtype":"Tag","name":"CodeBlock","attributes":{"data-language":"bash","header":{"controls":{"copy":{}}},"source":"[hadoop@ip-x-x-x-x]$ spark-shell --master yarn --jars td-spark-assembly-latest.jar --properties-file td.conf\nWelcome to\n      ____              __\n     / __/__  ___ _____/ /__\n    _\\ \\/ _ \\/ _ `/ __/  '_/\n   /___/ .__/\\_,_/_/ /_/\\_\\   version 2.1.0\n      /_/\nscala> import com.treasuredata.spark._\nscala> val td = spark.td\nscala> val d = td.table(\"sample_datasets.www_access\").df\nscala> d.show\n+----+---------------+--------------------+--------------------+----+--------------------+----+------+----------+\n|user|           host|                path|             referer|code|               agent|size|method|      time|\n+----+---------------+--------------------+--------------------+----+--------------------+----+------+----------+\n|null|136.162.131.221|    /category/health|   /category/cameras| 200|Mozilla/5.0 (Wind...|  77|   GET|1412373596|\n|null| 172.33.129.134|      /category/toys|   /item/office/4216| 200|Mozilla/5.0 (comp...| 115|   GET|1412373585|\n|null| 220.192.77.135|  /category/software|                   -| 200|Mozilla/5.0 (comp...| 116|   GET|1412373574|\n+----+---------------+--------------------+--------------------+----+--------------------+----+------+----------+\nonly showing top 3 rows\n","lang":"bash"},"children":[]},{"$$mdtype":"Tag","name":"Heading","attributes":{"level":2,"id":"using-zeppelin-notebook-on-emr","__idx":10},"children":["Using Zeppelin Notebook on EMR"]},{"$$mdtype":"Tag","name":"Heading","attributes":{"level":3,"id":"configure-zeppelin-for-td-spark","__idx":11},"children":["Configure Zeppelin for td-spark"]},{"$$mdtype":"Tag","name":"ol","attributes":{},"children":[{"$$mdtype":"Tag","name":"li","attributes":{},"children":["Create SSH Tunnel to EMR Cluster."]}]},{"$$mdtype":"Tag","name":"CodeBlock","attributes":{"data-language":"bash","header":{"controls":{"copy":{}}},"source":"$ ssh -i (your AWS key pair file. .pem) -D8157 hadoop@ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com\n","lang":"bash"},"children":[]},{"$$mdtype":"Tag","name":"p","attributes":{},"children":["(For Chrome users) Install Proxy Switchy Sharp Chrome ExtensionTurn on proxy-switch for EMR when accessing your EMR master"," ","2. Open ",{"$$mdtype":"Tag","name":"code","attributes":{},"children":["http://(your EMR master node public address):8890/"]}," ","3. Open the ",{"$$mdtype":"Tag","name":"strong","attributes":{},"children":["Interpreters"]}," page to configure td-spark."]},{"$$mdtype":"Tag","name":"p","attributes":{},"children":[{"$$mdtype":"Tag","name":"img","attributes":{"src":"/assets/image2020-11-18_9-15-28.a4f96f2069e1df24d7c95be0d35e0d01c8821747176e796e484e3c15a686c98f.d33fa2c4.png","alt":""},"children":[]}," ","4. Edit your profile details and select ",{"$$mdtype":"Tag","name":"strong","attributes":{},"children":["Save"]},"."]},{"$$mdtype":"Tag","name":"p","attributes":{},"children":[{"$$mdtype":"Tag","name":"img","attributes":{"src":"/assets/image2020-11-18_9-16-54.cb15bd5327e74c96e696f3d16f94c0eca3c92fc777e25e082d78cf668caca70b.d33fa2c4.png","alt":""},"children":[]}]},{"$$mdtype":"Tag","name":"Heading","attributes":{"level":3,"id":"access-dataset-in-td-as-dataframe","__idx":12},"children":["Access Dataset in TD as DataFrame"]},{"$$mdtype":"Tag","name":"p","attributes":{},"children":["You can read table data as Spark DataFrame."]},{"$$mdtype":"Tag","name":"p","attributes":{},"children":[{"$$mdtype":"Tag","name":"img","attributes":{"src":"/assets/image2020-11-18_9-17-56.82c18313f4714d4d7a00e721ac7cd60cd21bf1ec4ab882db2e1be59b85254977.d33fa2c4.png","alt":""},"children":[]}]},{"$$mdtype":"Tag","name":"Heading","attributes":{"level":3,"id":"running-presto-queries","__idx":13},"children":["Running Presto Queries"]},{"$$mdtype":"Tag","name":"p","attributes":{},"children":[{"$$mdtype":"Tag","name":"img","attributes":{"src":"/assets/image2020-11-18_9-18-48.0e69e43eeae788cd1b81f341a8ac3874bfcd86cb55306d80b9166ad729cf9b5a.d33fa2c4.png","alt":""},"children":[]}]},{"$$mdtype":"Tag","name":"Heading","attributes":{"level":3,"id":"checking-spark-history-server","__idx":14},"children":["Checking Spark History Server"]},{"$$mdtype":"Tag","name":"p","attributes":{},"children":["You can check your event history in the ",{"$$mdtype":"Tag","name":"strong","attributes":{},"children":["History Server"]}," using your EMR master node public address."]},{"$$mdtype":"Tag","name":"p","attributes":{},"children":[{"$$mdtype":"Tag","name":"code","attributes":{},"children":["http://(your EMR master node public address):18080/"]}]},{"$$mdtype":"Tag","name":"p","attributes":{},"children":[{"$$mdtype":"Tag","name":"img","attributes":{"src":"/assets/image2020-11-18_9-21-52.42a636aa3768897c576b4224cfb24be36c330e2135b7c1f7a7b9c5558e55b395.d33fa2c4.png","alt":""},"children":[]}]},{"$$mdtype":"Tag","name":"Heading","attributes":{"level":2,"id":"td-spark-driver-use-with-pyspark-and-sparksql","__idx":15},"children":["TD Spark Driver Use with PySpark and SparkSQL"]},{"$$mdtype":"Tag","name":"Heading","attributes":{"level":3,"id":"pyspark","__idx":16},"children":["PySpark"]},{"$$mdtype":"Tag","name":"CodeBlock","attributes":{"data-language":"bash","header":{"controls":{"copy":{}}},"source":"$ ./bin/pyspark  --driver-class-path ~/work/git/td-spark/td-spark/target/td-spark-assembly-0.1-SNAPSHOT.jar --properties-file ../td-dev.conf\n\n>>> df = spark.read.format(\"com.treasuredata.spark\").load(\"sample_datasets.www_access\")\n>>> df.show(10)\n2016-07-19 16:34:15-0700  info [TDRelation] Fetching www_access within time range:[-9223372036854775808,9223372036854775807) - (TDRelation.scala:82)\n2016-07-19 16:34:16-0700  info [TDRelation] Retrieved 19 PlazmaAPI entries - (TDRelation.scala:85)\n+----+---------------+--------------------+--------------------+----+--------------------+----+------+----------+\n|user|           host|                path|             referer|code|               agent|size|method|      time|\n+----+---------------+--------------------+--------------------+----+--------------------+----+------+----------+\n|null| 148.165.90.106|/category/electro...|     /category/music| 200|Mozilla/4.0 (comp...|  66|   GET|1412333993|\n|null|  144.105.77.66|/item/electronics...| /category/computers| 200|Mozilla/5.0 (iPad...| 135|   GET|1412333977|\n|null| 108.54.178.116|/category/electro...|  /category/software| 200|Mozilla/5.0 (Wind...|  69|   GET|1412333961|\n|null|104.129.105.202|/item/electronics...|     /item/games/394| 200|Mozilla/5.0 (comp...|  83|   GET|1412333945|\n|null|   208.48.26.63|  /item/software/706|/search/?c=Softwa...| 200|Mozilla/5.0 (comp...|  76|   GET|1412333930|\n|null|  108.78.209.95|/item/giftcards/4879|      /item/toys/197| 200|Mozilla/5.0 (Wind...| 137|   GET|1412333914|\n|null| 108.198.97.206|/item/computers/4785|                   -| 200|Mozilla/5.0 (Wind...|  69|   GET|1412333898|\n|null| 172.195.185.46|     /category/games|     /category/games| 200|Mozilla/5.0 (Maci...|  41|   GET|1412333882|\n|null|   88.24.72.177|/item/giftcards/4410|                   -| 200|Mozilla/4.0 (comp...|  72|   GET|1412333866|\n|null|  24.129.141.79|/category/electro...|/category/networking| 200|Mozilla/5.0 (comp...|  73|   GET|1412333850|\n+----+---------------+--------------------+--------------------+----+--------------------+----+------+----------+\nonly showing top 10 rows\n\n## Submitting presto job\n>>> df = spark.read.format(\"com.treasuredata.spark\").options(sql=\"select 1\").load(\"sample_datasets\")\n2016-07-19 16:56:56-0700  info [TDSparkContext]\nSubmitted job 515990:\nselect 1 - (TDSparkContext.scala:70)\n>>> df.show(10)\n+-----+\n|_col0|\n+-----+\n|    1|\n+-----+\n\n## Reading job results\n>>> df = sqlContext.read.format(\"com.treasuredata.spark\").load(\"job_id:515990\")\n","lang":"bash"},"children":[]},{"$$mdtype":"Tag","name":"Heading","attributes":{"level":3,"id":"sparksql","__idx":17},"children":["SparkSQL"]},{"$$mdtype":"Tag","name":"CodeBlock","attributes":{"data-language":"scala","header":{"controls":{"copy":{}}},"source":"# Register DataFrame as a temporary table\nscala> td.df(\"hb_tiny.rankings\").createOrReplaceTempView(\"rankings\")\n\nscala> val q1 = spark.sql(\"select page_url, page_rank from rankings where page_rank > 100\")\nq1: org.apache.spark.sql.DataFrame = [page_url: string, page_rank: bigint]\n\nscala> q1.show\n2016-07-20 11:27:11-0700  info [TDRelation] Fetching rankings within time range:[-9223372036854775808,9223372036854775807) - (TDRelation.scala:82)\n2016-07-20 11:27:12-0700  info [TDRelation] Retrieved 2 PlazmaAPI entries - (TDRelation.scala:85)\n+--------------------+---------+\n|            page_url|page_rank|\n+--------------------+---------+\n|xjhmjsuqolfklbvxn...|      251|\n|seozvzwkcfgnfuzfd...|      165|\n|fdgvmwbrjlmvuoquy...|      132|\n|gqghyyardomubrfsv...|      108|\n|qtqntqkvqioouwfuj...|      278|\n|wrwgqnhxviqnaacnc...|      135|\n|  cxdmunpixtrqnvglnt|      146|\n| ixgiosdefdnhrzqomnf|      126|\n|xybwfjcuhauxiopfi...|      112|\n|ecfuzdmqkvqktydvi...|      237|\n|dagtwwybivyiuxmkh...|      177|\n|emucailxlqlqazqru...|      134|\n|nzaxnvjaqxapdjnzb...|      119|\n|       ffygkvsklpmup|      332|\n|hnapejzsgqrzxdswz...|      171|\n|rvbyrwhzgfqvzqkus...|      148|\n|knwlhzmcyolhaccqr...|      104|\n|nbizrgdziebsaecse...|      665|\n|jakofwkgdcxmaaqph...|      187|\n|kvhuvcjzcudugtidf...|      120|\n+--------------------+---------+\nonly showing top 20 rows\n","lang":"scala"},"children":[]}]},"headings":[{"value":"Amazon Elastic MapReduce","id":"amazon-elastic-mapreduce","depth":1},{"value":"Overview","id":"overview","depth":2},{"value":"What Does td-spark Enable?","id":"what-does-td-spark-enable","depth":2},{"value":"Recommendations Regarding Use","id":"recommendations-regarding-use","depth":2},{"value":"TD Spark Driver on EMR","id":"td-spark-driver-on-emr","depth":2},{"value":"Create an EMR Spark Cluster","id":"create-an-emr-spark-cluster","depth":3},{"value":"Reference","id":"reference","depth":3},{"value":"Log-in to the EMR Cluster","id":"log-in-to-the-emr-cluster","depth":2},{"value":"Set Up TD Spark Integration","id":"set-up-td-spark-integration","depth":2},{"value":"Using spark-shell on EMR","id":"using-spark-shell-on-emr","depth":2},{"value":"Using Zeppelin Notebook on EMR","id":"using-zeppelin-notebook-on-emr","depth":2},{"value":"Configure Zeppelin for td-spark","id":"configure-zeppelin-for-td-spark","depth":3},{"value":"Access Dataset in TD as DataFrame","id":"access-dataset-in-td-as-dataframe","depth":3},{"value":"Running Presto Queries","id":"running-presto-queries","depth":3},{"value":"Checking Spark History Server","id":"checking-spark-history-server","depth":3},{"value":"TD Spark Driver Use with PySpark and SparkSQL","id":"td-spark-driver-use-with-pyspark-and-sparksql","depth":2},{"value":"PySpark","id":"pyspark","depth":3},{"value":"SparkSQL","id":"sparksql","depth":3}],"frontmatter":{"seo":{"title":"Amazon Elastic MapReduce"}},"lastModified":"2026-01-27T10:05:25.000Z","pagePropGetterError":{"message":"","name":""}},"slug":"/int/amazon-elastic-mapreduce","userData":{"isAuthenticated":false,"teams":["anonymous"]},"isPublic":true}