Tuesday, April 14, 2015

SQOOP common flow examples




Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases in both directions.

Sqoop metastore

        Running sqoop-metastore launches a shared HSQLDB database instance on the current machine. Clients can connect to this metastore and create jobs which can be shared between users for execution. The location of the metastore’s files on disk is controlled by the 
Argument                           Description
--create <job-id>         Define a new saved job with the specified job-id (jobname)
                                             A second Sqoop command-line, separated by a -- should be
                                             specified; this defines the saved job
--delete <job-id>         Delete a saved job
--exec <job-id>           Given a job defined with --create, run the saved job
--show <job-id>           Show the parameters for a saved job
--list                    List all saved jobs
--meta-connect <jdbc-uri> Specifies the JDBC connect string used to connect to the metastore


Example

This is an example of job execution:

$ sqoop job --meta-connect <metastore-url> --exec <jobName>
 

Import all tables for specific schema

The import-all-tables tool imports a set of tables from an RDBMS to HDFS. Data from each table is stored in a separate directory in HDFS.

sqoop job --create imp-ora-hr --meta-connect jdbc:hsqldb:hsql://localhost:16000/sqoop -- import-all-tables --connect jdbc:oracle:thin:@<server>/XE  --username <username> --password <password> --hive-import -m 1 --verbose


Argument
Description
--connect <jdbc-uri>
Specify JDBC connect string
--username <username>
Set authentication username
--password <password>
Set authentication password
--hive-import
Import tables into Hive (Uses Hive’s default delimiters if none are set)
--verbose
Print more information while working
--num-mappers or -m
Control the number of map tasks, which is the degree of parallelism used



Import full table


sqoop job --create imp-ora-hr-jobs --meta-connect jdbc:hsqldb:hsql://localhost:16000/sqoop -- import --connect jdbc:oracle:thin:@10.33.6.58:1521/XE --table HR.JOBS --username SYSTEM --password SYS --hive-import --hive-table HR.JOBS -m 1 --verbose

Import table with incompatible columns


sqoop job --create imp-ora-DBSNMP-BASELINES --meta-connect jdbc:hsqldb:hsql://localhost:16000/sqoop -- import --connect jdbc:oracle:thin:@10.10.10.10:1521/XE --table DBSNMP.MGMT_BSLN_BASELINES --columns BSLN_GUID,TARGET_UID,NAME,TYPE,STATUS --username SYSTEM --password SYS --hive-import --hive-table DBSNMP.BASELINES -m 1 --map-column-hive TARGET_UID=String,BSLN_GUID=String
   

Import by query


sqoop job --create imp-ora-query --meta-connect jdbc:hsqldb:hsql://localhost:16000/sqoop -- import --connect jdbc:oracle:thin:@10.33.6.58:1521/XE --query "SELECT E.EMAIL, D.DEPARTMENT_NAME, L.CITY, L.STREET_ADDRESS FROM HR.DEPARTMENTS D, HR.LOCATIONS L, HR.EMPLOYEES E WHERE D.LOCATION_ID = L.LOCATION_ID and D.MANAGER_ID = E.MANAGER_ID AND \$CONDITIONS" --username SYSTEM --password SYS --hive-import --hive-table MANAGERS -m 1 --target-dir /bigdata/managers


Export


sqoop job --create export-by-procedure --meta-connect jdbc:hsqldb:hsql://localhost:16000/sqoop -- export --connect jdbc:oracle:thin:@10.10.10.10:1521/XE --username GOSALESDW --password GOSALESDWPW --call FILL_BURST_TABLE --export-dir /apps/hive/warehouse/burst_table --input-fields-terminated-by '\001' --input-lines-terminated-by '\012' --input-null-string '\\N' --input-null-non-string '\\N'




Export with procedure


sqoop job --create export-by-procedure --meta-connect jdbc:hsqldb:hsql://localhost:16000/sqoop -- export --connect jdbc:oracle:thin:@10.33.6.58:1521/XE --username GOSALESDW --password GOSALESDWPW --call FILL_BURST_TABLE --export-dir /apps/hive/warehouse/burst_table --input-fields-terminated-by '\001' --input-lines-terminated-by '\012' --input-null-string '\\N' --input-null-non-string '\\N'


Procedure:

create or replace PROCEDURE           FILL_BURST_TABLE

(

  RECIPIENTS_IN IN VARCHAR2,

  COUNTRY_CODE_IN IN NUMBER

) AS

BEGIN

  INSERT INTO BURST_TABLE (COUNTRY_CODE, RECIPIENTS) VALUES (COUNTRY_CODE_IN, RECIPIENTS_IN);

END FILL_BURST_TABLE;


References: