Automate running the same Datameer Workbook on different data sets

Leave a comment

Datameer is an extremely user friendly software platform for self-service end-to-end big data discovery and analytics. You can do just about anything in the realm of descriptive analytics using the Datameer GUI. You can automate a complex series of steps just by pointing and clicking with software that is as easy to use as Microsoft Office.

But there are some activities that require using the Datameer REST API. A prospect recently asked me “How can I run the same Datameer workbook on 50 different data sets that share the same column layout without doing the same thing 50 times manually using the GUI?” The answer is creating a script that invokes the Datameer REST API. The beauty of a REST API is that you can use the scripting language of your choice to tackle the problem. I ended up creating sample scripts in both Bash and python to illustrate that you can pick your language.

The first order of business is defining the problem. In this case, we needed a mechanism to iterate over a set of data sources all of which shared the same column layout. While doing that iteration, the script needs to update the data source of the Datameer Workbook in question. In fact, the script also had to create a new data source first, and then update the workbook data source. And finally, while iterating through the list, the script needs to execute the newly updated workbook. Further, I chose to implement a mechanism to check that the workbook had completed prior to starting the next workbook, rather than move on to the next cycle without waiting for the workbook job to complete. The data source I used in Datameer is called a Data Link. Data Links extract a sample data set from the source, without duplicating the entire source set. When you prefer to not only extract a sample but also get the entire data set available in Datameer’s HDFS folder, then use Import Jobs.

Pseudo code for my example looks like this:

WHILE LOOPING THROUGH LIST OF DATA SOURCES

CREATE A NEW DATA LINK for the CURRENT DATA SOURCE

UPDATE THE WORKBOOK DEFINITION TO USE THE NEW DATA LINK

RUN THE WORKBOOK

CHECK THAT THE WORKBOOK IS DONE

I ended up creating two functions for steps that got repeated a lot, which were 1) updating JSON files that contain the definitions of Data Links and Workbooks, and 2) checking on job status.

Notes on doing a REST API call:
In bash: I used the curl command to do the REST call. I also used back ticks to place the output of the curl command into a variable. The line of code looks like this:

# Note the use of variables to store username, password, host, port, etc.
# Use of the correct quotes is important.
RET_VAL=`curl -u $USER:$PASSWORD -X GET "http://$HOST:$PORT/rest/job-configuration/job-status/$CONFIGURATION_ID"`
# Unfortunately, I chose a pretty ugly method to parse the variable RET_VAL. A combo of grep, awk and tr.
STATUS=`echo "$RET_VAL" | grep "jobStatus" | awk -F: '{print $2}' | tr -d '[:space:]' | tr -d \"`

In python: I used a python module called requests to invoke Datameer’s REST API.

# The python implementation is cleaner. Run the request, using the auth parameter and the HTTPBasicAuth function to handle authentication.
ret_val = requests.get("http://" + host + ":" + port + "/rest/job-configuration/job-status/" + configuration_id, auth=HTTPBasicAuth(user, password))
# Then convert the return value to JSON.
params = ret_val.json()
# And finally pull out the parameter that you want.
status = params['jobStatus']

Below you can find the source code in both Bash and python.

The Bash example.


#!/bin/bash

# Example script for creating several datalinks on similar tables and then creating
# workbooks to run on those data links.

# Script assumes you have one working data link and one working workbook, and have
# modified the underlying JSON to include some tags that sed can look for and replace.

# Datameer environment parameters
USER=your_user
PASSWORD=your_password
HOST=localhost
PORT=7996

# Script parameters
# Parameter 1: DATALINK_TEMPLATE_FILE - Contains datalink JSON to be modified. Script does not alter
#              this file. Instead, it creates a temporary file and runs that.
# Parameter 2: WORKBOOK_TEMPLATE_FILE - Contains workbook JSON to be modified. Script does not alter
#              this file. Instead, it creates a temporary file and runs that.
# Parameter 3: LIST_OF_TABLES - Contains a list of the target tables to be processed.

# Assign parameters to global variables.
DATALINK_TEMPLATE_FILE=$1
WORKBOOK_TEMPLATE_FILE=$2
LIST_OF_TABLES=$3

# Temporary file that contains edits when they happen.
TEMP_FILE=zzz_temp_file.json

# Sub Routine: do_change. Runs sed utility to do update.
# Positional arguments
# 1. Value to look for.
# 2. New value to put in.
do_change () {
# Get positional arguments for function.
CUR_VAL=$1
NEW_VAL=$2

	echo "Updating value $CUR_VAL with $NEW_VAL..."
	sed -e "s/$CUR_VAL/$NEW_VAL/g" -i bak $TEMP_FILE
}

# Sub Routine: check_job_status. Stay in loop until job completes.
# Positional arguments
# 1. Job ID.
check_job_status () {
# Get positional arguments for function.
CONFIGURATION_ID=$1

	echo "Checking job status for $JOB_ID"
	echo "'http://localhost:7996/rest/job-configuration/job-status/$JOB_ID'"

    # Run loop until status shows job is done.
    while [[ 1 == 1 ]]; do
	    RET_VAL=`curl -u $USER:$PASSWORD -X GET "http://$HOST:$PORT/rest/job-configuration/job-status/$CONFIGURATION_ID"`
    	STATUS=`echo "$RET_VAL" | grep "jobStatus" | awk -F: '{print $2}' | tr -d '[:space:]' | tr -d \"`
    	echo $STATUS

    	sleep .5

    	if [[ $STATUS = "COMPLETED" || $STATUS = "ERROR" || $STATUS = "COMPLETED_WITH_WARNINGS"  || $STATUS = "ABORTING" || $STATUS = "_NONE" ]]
    	then
    		break
    	fi
    done
}

# Main loop for running through all tables in the list.
while read table_name; do

	# Edit Datalink JSON
	cp $DATALINK_TEMPLATE_FILE $TEMP_FILE
	do_change "
<TABLE>" $table_name $DATALINK_TEMPLATE_FILE
	do_change "" $table_name $DATALINK_TEMPLATE_FILE

	# Issue command to create new Datalink. Capture result in variable.
    RET_VAL=`curl -u $USER:$PASSWORD -X POST -d @$TEMP_FILE "http://${HOST}:${PORT}/rest/import-job"`
	# Display result to screen.
    echo "$RET_VAL"
    # Parse result to get configuration id.
    CONFIGURATION_ID=`echo "$RET_VAL" | grep "configuration" | awk -F: '{print $2}' | tr -d '[:space:]'`

	#Display result to screen.
	echo $CONFIGURATION_ID

	# Now run the Datalink to get sample data.
	curl -u $USER:$PASSWORD -X POST "http://${HOST}:${PORT}/rest/job-execution?configuration=${CONFIGURATION_ID}"	

	# Pause until the data link finishes.
	check_job_status $CONFIGURATION_ID	

	# Edit Workbook JSON to include the new datalink
	cp $WORKBOOK_TEMPLATE_FILE $TEMP_FILE
	do_change "" $table_name $WORKBOOK_TEMPLATE_FILE	

	# Issue command to create new workbook
    RET_VAL=`curl -u $USER:$PASSWORD -X POST -d @$TEMP_FILE "http://$HOST:$PORT/rest/workbook"`
	# Display result to screen.
    echo "$RET_VAL"
    # Parse result to get configuration id.
    CONFIGURATION_ID=`echo "$RET_VAL" | grep "configuration" | awk -F: '{print $2}' | tr -d '[:space:]'`

	# Now run the new workbook.
	#Display result to screen.
	echo $CONFIGURATION_ID
	# Now run the Workbook to get sample data.
	curl -u $USER:$PASSWORD -X POST "http://${HOST}:${PORT}/rest/job-execution?configuration=${CONFIGURATION_ID}"

	# Pause until the workbook finishes.
	check_job_status $CONFIGURATION_ID	

done < $LIST_OF_TABLES

And the python example.

# Example script for creating several datalinks on similar tables and then creating
# workbooks to run on those data links.

# Script assumes you have one working data link and one working workbook, and have
# modified the underlying JSON to include some tags that sed can look for and replace.

import sys
import shutil
import requests
from requests.auth import HTTPBasicAuth
import json
import time

# Datameer environment parameters
user = "your_user"
password = "your_password"
host = "localhost"
port = "7996"

# Get input parameters
datalink_template_file = sys.argv[1]
workbook_template_file = sys.argv[2]
list_of_tables = sys.argv[3]

# Temp file to use for creating dynamic JSON to create Datameer objects.
temp_file = "./zzz_temp_file.json"
temp_file2 = "./zzz_temp_file.2.json"

# Function for updating JSON file contents
# Open one file for read, another for write. Make changes. At end, swap files.
def do_change(cur_val, new_val, file_name, file_name2):
    with open(file_name, "rt") as f_in:
        with open(file_name2, "wt") as f_out:
            for line in f_in:
                f_out.write(line.replace(cur_val, new_val))
    f_in.close()
    f_out.close()
    shutil.copy(file_name2, file_name)

def check_job_status(configuration_id):
    print ("Checking job status for" + configuration_id)
    print ("'http://localhost:7996/rest/job-configuration/job-status/" + configuration_id + "'")

        # Run loop until status shows job is done.
    while 1 == 1:
        ret_val = requests.get("http://" + host + ":" + port + "/rest/job-configuration/job-status/" + configuration_id, auth=HTTPBasicAuth(user, password))
        params = ret_val.json()
        print (params)
    	status = params['jobStatus']
        print (status)
    	time.sleep(1)

    	# Check status and break out of loop if when job is done.
    	if (status == "COMPLETED") or (status == "ERROR") or (status == "COMPLETED_WITH_WARNINGS") or (status == "ABORTING" ) or (status == "_NONE"):
    		break

# Main loop iterating through list of tables.
with open(list_of_tables,'r') as table_file:

    for table_name in table_file:

        table_name = table_name.rstrip()
        if not table_name: continue

        print ("Copying " + datalink_template_file + " for " + table_name)
        # Copy the template file to a temporary working file.
        shutil.copy(datalink_template_file, temp_file)

        do_change ("<TABLE_NAME>", table_name, temp_file, temp_file2)
        do_change ("<LINK_NAME>", table_name, temp_file, temp_file2)

        with open(temp_file, 'r') as content_file:
            payload = content_file.read()

        # Add new datalink to Datameer.
        ret_val = requests.post("http://" + host + ":" + port + "/rest/import-job", auth=HTTPBasicAuth(user, password), data=payload)
        print (ret_val.json())
        params = ret_val.json()
        configuration_id = str(params['configuration-id'])
        print (configuration_id)

        # Now start the datalink to get the sample data.
        ret_val = requests.post("http://" + host + ":" + port + "/rest/job-execution?configuration=" + configuration_id, auth=HTTPBasicAuth(user, password))

        # Pause until the data link finishes.
        check_job_status (configuration_id)

        # Edit Workbook JSON to include the new datalink
        shutil.copy(workbook_template_file, temp_file)
        do_change ("<LINK_NAME>", table_name, temp_file, temp_file2)
        with open(temp_file, 'r') as content_file:
            payload = content_file.read()

        # Issue command to create new workbook
        ret_val = requests.post("http://" + host + ":" + port + "/rest/workbook", auth=HTTPBasicAuth(user, password), data=payload)        

        # Display result to screen.
        print(ret_val)
        # Parse result to get configuration id.
        params = ret_val.json()
        configuration_id = str(params['configuration-id'])
        print (configuration_id)	

        # Now run the Workbook to do calculation.
        ret_val = requests.post("http://" + host + ":" + port + "/rest/job-execution?configuration=" + configuration_id, auth=HTTPBasicAuth(user, password))

        # Pause until the workbook finishes.
        check_job_status (configuration_id)

Back to the bash example.
Back to the python example.

Rebuilding Yum Database

Leave a comment

When starting Hadoop components on the Linux virtual machines on my laptop, I periodically get failures at startup. The culprit? A corrupt yum database. I’ve been running Hortonworks 2.0 recently and have used the Ambari GUI to start services. The failure I encounter looks like this:

notice: Finished catalog run in 0.65 seconds
err: /Stage[1]/Hdp/Hdp::Lzo::Package[64]/Hdp::Package[lzo 64]/Hdp::Package::Process_pkg[lzo 64]/Package[lzo]/ensure: change from absent to present failed: Execution of '/usr/bin/yum -d 0 -e 0 -y install lzo' returned 1: rpmdb: Thread/process 3570/140380093466368 failed: Thread died in Berkeley DB library
error: db3 error(-30974) from dbenv->failchk: DB_RUNRECOVERY: Fatal error, run database recovery
error: cannot open Packages index using db3 -  (-30974)
error: cannot open Packages database in /var/lib/rpm
CRITICAL:yum.main:

Error: rpmdb open failed

The following series of commands straighten out the situation I typically encounter.

# rm -f /var/lib/rpm/__db*
# db_verify /var/lib/rpm/Packages
# rpm --rebuilddb
# yum clean all

The command yum clean all will not work until the first 3 commands are run, should the case be the DB_RUNRECOVERY error show above.

vmware fusion fixed ip stuff

Leave a comment

As a presales engineer at a small big data company with our own database that runs on many platforms, I now use VMWare Fusion all the time.

Here’s some of the grunt work I do when creating new VMs for which I want a fixed IP address and a changed hostname.

If I make the new VM and it’s linux, I may use the VMWare rapid installation feature. Then I want to change the root password. That just requires a restart. During restart, hit any key on the keyboard to bring up the Linux GRUB screen. Then hit “e” on the keyboard to edit the start parameters. Use the down arrow to get to the line starting with “kernel”. Then hit e again to edit that line. Delete the last two words, “rhgb quiet” and replace them with “1”. Then hit enter to preserver your change and then b to boot. You will start in single user mode as root. From there, you can use the command “passed” to provide a new password for user root.

Then to change the hostname, I use these three steps on Centos Linux, after becoming root of course.
1. Use the hostname command to change the host name.
[root@localhost ~]# hostname yournewhostname.local

2. Edit /etc/sysconfig/network to include the new hostname.
NETWORKING=yes
HOSTNAME=yournewhostname.local

3. Back on my Mac, I figure out what the static IP address will be. With VMWare Fusion 5 (and apparently 4 or higher) this involves editing the following file:
/Library/Preferences/VMware Fusion/vmnet8/dhcpd.conf

Here’s a great write up on what to do.

4. Edit /etc/hosts to include the new hostname. Also include the static ip you chose in step 3.

5. I’ve been using Centos for my VMs. I’ve been using the GUI “Network Connections” tool under the System – Preferences menu to create the static ip address. On the first screen, select your ethernet card. Mine shows up as eth0 on the Centos Linux VMs I’ve created. Then click the Edit button.

On the resulting screen, I click on the IPv4 tab.

Then I switch the method to Manual. I click Add and make the address my new static IP that I chose in step 3 above. Netmask is 255.255.255.0. The gateway is the IP address of your Mac. I also make the DNS server the IP address for my Mac.

Privileges Needed to Backup Databases

Leave a comment

Creating backups are one of the most vital tasks in maintaining computer infrastructure. System administrators and Database Administrators in particular get saddled with this unglamorous task. Backing up an operating system (OS) generally does not require any special manipulation of the OS prior to sending OS files to a backup target such as tape, disk or deduplicated disk targets like Data Domain. A system administrator can do things the old fashioned way such as generating a tar file and sending it to the backup target. Or he or she can deploy sophisticated backup software such as EMC’s Avamar to administrate and automate the backup process across hundreds or thousands of computers.

In any event, backing up the OS generally requires the root or Administrator level privilege. And this is rarely a problem because the System Administration teams runs the backups, and those users typically have that privilege anyway.

Backing up databases presents a variety of challenges. One is that for many database systems the Database Administrator (DBA) must place the database in a special mode prior to creating the backup. Another is that on many popular database systems, the DBA must backup files that store the record of transactions since these are used to permit recovery to any point in time. Such files are known as transaction logs on SQL Server, Sybase, Postgres and DB2. They are called archive logs in Oracle, although older Oracle documents called them offline redo logs.

And another challenge is that the privilege needed to backup a database varies among different database vendors. If the DBA team is running the backups, the level of privilege needed within the database is not a problem. But if the database backups are getting run by another team, then the DBA team may be reluctant to give out a high level of privilege in order to backups. What to do? To date, there is not much flexibility in the privilege needed to do the database backup.

Here’s a table spelling out backup privileges needed by an internal database user running a backup.

DATABASE PRIVILEGES NEEDED BY DATABASE USER RUNNING THE DATABASE BACKUP OTHER CONSIDERATIONS
ORACLE The database user executing the RMAN backup needs the SYSDBA role. The users SYS and SYSTEM have this role. Or a new user can be created, but it must have SYSDBA.
If using an RMAN catalog, the database user that connects to the RMAN catalog does NOT have to have SYSDBA. It does need the RECOVERY_CATALOG_OWNER role. In other words, one use can run the backup and another user can record the activity in the RMAN catalog.

And what about Oracle 12C? It’s possible that the privilege needed to backup the new pluggable database model will no longer need SYSDBA. I’ll update this post when 12C gets released.

SQL SERVER For an individual database, a SQL Server database user with the database level role db_backupoperator can perform a backup upon that database. NOTE that this privilege must be granted within each new database that gets added to the instance. The server level role sysadmin will allow a user to backup any database, even new ones. While db_backupoperator is much lower privileged than sysadmin, only sysadmin insures that new databases will get backed up. Using db_backupoperator requires explicitly granting the privilege on each new database.
SYBASE ASE The lowest level of privilege needed is the Operator role. Operator will permit backup and restore on any database in the instance.

The System Administrator (SA) privilege also provides this, but this is the highest level role.

Finally, database owner can do backup and restore, but this privilege must be granted individually for each database.

The Operator role will work with new databases, so it is a good choice.
DB2 You must have SYSADM, SYSCTRL, or SYSMAINT authority to use the backup and recover utilities in DB2. SYSADM is the most powerful role of the 3. SYSCTRL controls operations that affect system resources, but cannot update table data. SYSMAINT can perform maintenance operations on all individual databases within an instance, but like SYSCTRL, cannot update table data. SYSCTLR and SYSMAINT both offer a lower privileged account than SYSADM.

NOTE: I plan to add to the above table as I get more information on databases.

Creating an RMAN Recovery Catalog

Leave a comment

Need to create and RMAN Recovery catalog? Check the Creating a Recovery Catalog section of the Backup and Recovery User Guide document.

Steps listed there are:
1. Create a tablespace for the catalog
2. Create a user that will own the schema and assign to it the newly created tablespace.
3. Grant the RECOVERY_CATALOG_OWNER privilege to the user, and if 11g or higher, grant CONNECT as well.

SQL> create tablespace TOOLS datafile 'C:\ORADATA\RCATDB\TOOLS_01.DBF' size 100M;

Tablespace created.

SQL> create user RCAT identified by RCAT default tablespace TOOLS temporary tablespace TEMP quota unlimited on TOOLS;

User created.

SQL> grant recovery_catalog_owner to RCAT;

Grant succeeded.

SQL> grant connect to RCAT;

Grant succeeded.

4. Using the RMAN utility, login with the CONNECT CATALOG syntax as the recovery catalog schema owner and run CREATE CATALOG. This will build the schema. You can specify a tablespace if desired.


C:\Users\emcadmin> rman

Recovery Manager: Release 11.2.0.1.0 - Production on Wed Jan 2 18:30:33 2013

Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.

RMAN> connect catalog RCAT/RCAT

connected to recovery catalog database

RMAN> create catalog tablespace tools;

recovery catalog created

At this point one can begin registering within the catalog databases to be backed up.


RMAN> connect target;

RMAN> register database; 

SQL for checking RMAN catalog or control file for backup information

Leave a comment

Do you need a quick query to check the local control file to see if backups succeeded? Use the below if there is no RMAN catalog involved.

/*
Query to use when storing meta data in control file.
SID column will be 0 unless working with Oracle RAC.
*/
select sid, object_type, status, 
round((end_time - start_time) * 24 * 60, 2) duration_minutes,  
to_char(start_time, 'mm/dd/yyyy hh:mi:ss') start_time, 
to_char(end_time, 'mm/dd/yyyy hh:mi:ss') end_time,
round((input_bytes/(1024*1024*1024)),2) input_gb, 
round((output_bytes/(1024*1024*1024)),2) output_gb
from v$rman_status
where operation = 'BACKUP';

If you are using the RMAN catalog, then you can run this query instead.

/*
Query to use when storing meta data RMAN catalog.
*/
select db_name, object_type, status, 
round((end_time - start_time) * 24 * 60, 2) duration_minutes,  
to_char(start_time, 'mm/dd/yyyy hh:mi:ss') start_time, 
to_char(end_time, 'mm/dd/yyyy hh:mi:ss') end_time,
round((input_bytes/(1024*1024*1024)),2) input_gb, 
round((output_bytes/(1024*1024*1024)),2) output_gb
from rc_rman_status
where operation = 'BACKUP';
/*
ADD THE FOLLOWING TO LIMIT THE ROWS RETURNED B
where end_time  &gt; (SYSDATE - 31)
*/

quick peak at postgres post install on ubuntu

Leave a comment

I’m checking out how postgres works.

I installed it to 32-bit ubuntu version 12.04. Googling lead to multiple places for the appropriate apt-get commands.

The install created a user, postgres, that runs the postgres binaries. However, root owns the binary files.

A default instance got created. I haven’t learned yet if people often run multiple postgres instances per host.

The following processes run with the default installation:
/usr/lib/postgresql/9.1/bin/postgres
postgres: writer process
postgres: WAL process
postgres: autovacuum launcher
postgres: stats collector process

The first process is the main postgres process, and it was launched with the -D parameter pointing to a specific directory and the -c parameter pointing to the full path of the postgres.conf file.

The writer process I’m surmising must write to data files, and the WAL process I’ve read elsewhere is the Write Ahead Log, similar to redo log writer in Oracle. Autovacuum launcher governs the ability to automatically run the VACUUM command, which is needed in Postgres periodically. And I’m sure the stats collector updates query optimization stats, but I’ll have to check.

There’s a command psql that is the equal of sqlplus. I’ll explore psql in a follow up post.

Documentation for postgres can be found at postgresql.org.

Having worked as a SQL Server and Oracle DBA, keeping track of database storage is important. Documentation for those two products describes early on how each system places all objects into datafiles. A datafile can contain tables, indexes, stored procedures, views and everything else.

Postgres on the other hand relegates discussion of physical storage to a location fairly deep in the documentation. Each table gets its own datafile. A master directory tree contains all the object in the postgres database, with most objects getting their own separate file. And postgres dictates the directory structure, although perhaps in more advanced deployments users can control some aspects. The filenames have a number which is automatically generated by postgres. My instance installed to /var/lib/postgresql/9.1/main. There are multiple sub-directories below that.

Done writing for now, but I’m going to create some tables, bang around with psql and try out the gui admin tool pgadmin III.

Older Entries