Automate running the same Datameer Workbook on different data sets

Leave a comment

Datameer is an extremely user friendly software platform for self-service end-to-end big data discovery and analytics. You can do just about anything in the realm of descriptive analytics using the Datameer GUI. You can automate a complex series of steps just by pointing and clicking with software that is as easy to use as Microsoft Office.

But there are some activities that require using the Datameer REST API. A prospect recently asked me “How can I run the same Datameer workbook on 50 different data sets that share the same column layout without doing the same thing 50 times manually using the GUI?” The answer is creating a script that invokes the Datameer REST API. The beauty of a REST API is that you can use the scripting language of your choice to tackle the problem. I ended up creating sample scripts in both Bash and python to illustrate that you can pick your language.

The first order of business is defining the problem. In this case, we needed a mechanism to iterate over a set of data sources all of which shared the same column layout. While doing that iteration, the script needs to update the data source of the Datameer Workbook in question. In fact, the script also had to create a new data source first, and then update the workbook data source. And finally, while iterating through the list, the script needs to execute the newly updated workbook. Further, I chose to implement a mechanism to check that the workbook had completed prior to starting the next workbook, rather than move on to the next cycle without waiting for the workbook job to complete. The data source I used in Datameer is called a Data Link. Data Links extract a sample data set from the source, without duplicating the entire source set. When you prefer to not only extract a sample but also get the entire data set available in Datameer’s HDFS folder, then use Import Jobs.

Pseudo code for my example looks like this:

WHILE LOOPING THROUGH LIST OF DATA SOURCES

CREATE A NEW DATA LINK for the CURRENT DATA SOURCE

UPDATE THE WORKBOOK DEFINITION TO USE THE NEW DATA LINK

RUN THE WORKBOOK

CHECK THAT THE WORKBOOK IS DONE

I ended up creating two functions for steps that got repeated a lot, which were 1) updating JSON files that contain the definitions of Data Links and Workbooks, and 2) checking on job status.

Notes on doing a REST API call:
In bash: I used the curl command to do the REST call. I also used back ticks to place the output of the curl command into a variable. The line of code looks like this:

# Note the use of variables to store username, password, host, port, etc.
# Use of the correct quotes is important.
RET_VAL=`curl -u $USER:$PASSWORD -X GET "http://$HOST:$PORT/rest/job-configuration/job-status/$CONFIGURATION_ID"`
# Unfortunately, I chose a pretty ugly method to parse the variable RET_VAL. A combo of grep, awk and tr.
STATUS=`echo "$RET_VAL" | grep "jobStatus" | awk -F: '{print $2}' | tr -d '[:space:]' | tr -d \"`

In python: I used a python module called requests to invoke Datameer’s REST API.

# The python implementation is cleaner. Run the request, using the auth parameter and the HTTPBasicAuth function to handle authentication.
ret_val = requests.get("http://" + host + ":" + port + "/rest/job-configuration/job-status/" + configuration_id, auth=HTTPBasicAuth(user, password))
# Then convert the return value to JSON.
params = ret_val.json()
# And finally pull out the parameter that you want.
status = params['jobStatus']

Below you can find the source code in both Bash and python.

The Bash example.


#!/bin/bash

# Example script for creating several datalinks on similar tables and then creating
# workbooks to run on those data links.

# Script assumes you have one working data link and one working workbook, and have
# modified the underlying JSON to include some tags that sed can look for and replace.

# Datameer environment parameters
USER=your_user
PASSWORD=your_password
HOST=localhost
PORT=7996

# Script parameters
# Parameter 1: DATALINK_TEMPLATE_FILE - Contains datalink JSON to be modified. Script does not alter
#              this file. Instead, it creates a temporary file and runs that.
# Parameter 2: WORKBOOK_TEMPLATE_FILE - Contains workbook JSON to be modified. Script does not alter
#              this file. Instead, it creates a temporary file and runs that.
# Parameter 3: LIST_OF_TABLES - Contains a list of the target tables to be processed.

# Assign parameters to global variables.
DATALINK_TEMPLATE_FILE=$1
WORKBOOK_TEMPLATE_FILE=$2
LIST_OF_TABLES=$3

# Temporary file that contains edits when they happen.
TEMP_FILE=zzz_temp_file.json

# Sub Routine: do_change. Runs sed utility to do update.
# Positional arguments
# 1. Value to look for.
# 2. New value to put in.
do_change () {
# Get positional arguments for function.
CUR_VAL=$1
NEW_VAL=$2

	echo "Updating value $CUR_VAL with $NEW_VAL..."
	sed -e "s/$CUR_VAL/$NEW_VAL/g" -i bak $TEMP_FILE
}

# Sub Routine: check_job_status. Stay in loop until job completes.
# Positional arguments
# 1. Job ID.
check_job_status () {
# Get positional arguments for function.
CONFIGURATION_ID=$1

	echo "Checking job status for $JOB_ID"
	echo "'http://localhost:7996/rest/job-configuration/job-status/$JOB_ID'"

    # Run loop until status shows job is done.
    while [[ 1 == 1 ]]; do
	    RET_VAL=`curl -u $USER:$PASSWORD -X GET "http://$HOST:$PORT/rest/job-configuration/job-status/$CONFIGURATION_ID"`
    	STATUS=`echo "$RET_VAL" | grep "jobStatus" | awk -F: '{print $2}' | tr -d '[:space:]' | tr -d \"`
    	echo $STATUS

    	sleep .5

    	if [[ $STATUS = "COMPLETED" || $STATUS = "ERROR" || $STATUS = "COMPLETED_WITH_WARNINGS"  || $STATUS = "ABORTING" || $STATUS = "_NONE" ]]
    	then
    		break
    	fi
    done
}

# Main loop for running through all tables in the list.
while read table_name; do

	# Edit Datalink JSON
	cp $DATALINK_TEMPLATE_FILE $TEMP_FILE
	do_change "
<TABLE>" $table_name $DATALINK_TEMPLATE_FILE
	do_change "" $table_name $DATALINK_TEMPLATE_FILE

	# Issue command to create new Datalink. Capture result in variable.
    RET_VAL=`curl -u $USER:$PASSWORD -X POST -d @$TEMP_FILE "http://${HOST}:${PORT}/rest/import-job"`
	# Display result to screen.
    echo "$RET_VAL"
    # Parse result to get configuration id.
    CONFIGURATION_ID=`echo "$RET_VAL" | grep "configuration" | awk -F: '{print $2}' | tr -d '[:space:]'`

	#Display result to screen.
	echo $CONFIGURATION_ID

	# Now run the Datalink to get sample data.
	curl -u $USER:$PASSWORD -X POST "http://${HOST}:${PORT}/rest/job-execution?configuration=${CONFIGURATION_ID}"	

	# Pause until the data link finishes.
	check_job_status $CONFIGURATION_ID	

	# Edit Workbook JSON to include the new datalink
	cp $WORKBOOK_TEMPLATE_FILE $TEMP_FILE
	do_change "" $table_name $WORKBOOK_TEMPLATE_FILE	

	# Issue command to create new workbook
    RET_VAL=`curl -u $USER:$PASSWORD -X POST -d @$TEMP_FILE "http://$HOST:$PORT/rest/workbook"`
	# Display result to screen.
    echo "$RET_VAL"
    # Parse result to get configuration id.
    CONFIGURATION_ID=`echo "$RET_VAL" | grep "configuration" | awk -F: '{print $2}' | tr -d '[:space:]'`

	# Now run the new workbook.
	#Display result to screen.
	echo $CONFIGURATION_ID
	# Now run the Workbook to get sample data.
	curl -u $USER:$PASSWORD -X POST "http://${HOST}:${PORT}/rest/job-execution?configuration=${CONFIGURATION_ID}"

	# Pause until the workbook finishes.
	check_job_status $CONFIGURATION_ID	

done < $LIST_OF_TABLES

And the python example.

# Example script for creating several datalinks on similar tables and then creating
# workbooks to run on those data links.

# Script assumes you have one working data link and one working workbook, and have
# modified the underlying JSON to include some tags that sed can look for and replace.

import sys
import shutil
import requests
from requests.auth import HTTPBasicAuth
import json
import time

# Datameer environment parameters
user = "your_user"
password = "your_password"
host = "localhost"
port = "7996"

# Get input parameters
datalink_template_file = sys.argv[1]
workbook_template_file = sys.argv[2]
list_of_tables = sys.argv[3]

# Temp file to use for creating dynamic JSON to create Datameer objects.
temp_file = "./zzz_temp_file.json"
temp_file2 = "./zzz_temp_file.2.json"

# Function for updating JSON file contents
# Open one file for read, another for write. Make changes. At end, swap files.
def do_change(cur_val, new_val, file_name, file_name2):
    with open(file_name, "rt") as f_in:
        with open(file_name2, "wt") as f_out:
            for line in f_in:
                f_out.write(line.replace(cur_val, new_val))
    f_in.close()
    f_out.close()
    shutil.copy(file_name2, file_name)

def check_job_status(configuration_id):
    print ("Checking job status for" + configuration_id)
    print ("'http://localhost:7996/rest/job-configuration/job-status/" + configuration_id + "'")

        # Run loop until status shows job is done.
    while 1 == 1:
        ret_val = requests.get("http://" + host + ":" + port + "/rest/job-configuration/job-status/" + configuration_id, auth=HTTPBasicAuth(user, password))
        params = ret_val.json()
        print (params)
    	status = params['jobStatus']
        print (status)
    	time.sleep(1)

    	# Check status and break out of loop if when job is done.
    	if (status == "COMPLETED") or (status == "ERROR") or (status == "COMPLETED_WITH_WARNINGS") or (status == "ABORTING" ) or (status == "_NONE"):
    		break

# Main loop iterating through list of tables.
with open(list_of_tables,'r') as table_file:

    for table_name in table_file:

        table_name = table_name.rstrip()
        if not table_name: continue

        print ("Copying " + datalink_template_file + " for " + table_name)
        # Copy the template file to a temporary working file.
        shutil.copy(datalink_template_file, temp_file)

        do_change ("<TABLE_NAME>", table_name, temp_file, temp_file2)
        do_change ("<LINK_NAME>", table_name, temp_file, temp_file2)

        with open(temp_file, 'r') as content_file:
            payload = content_file.read()

        # Add new datalink to Datameer.
        ret_val = requests.post("http://" + host + ":" + port + "/rest/import-job", auth=HTTPBasicAuth(user, password), data=payload)
        print (ret_val.json())
        params = ret_val.json()
        configuration_id = str(params['configuration-id'])
        print (configuration_id)

        # Now start the datalink to get the sample data.
        ret_val = requests.post("http://" + host + ":" + port + "/rest/job-execution?configuration=" + configuration_id, auth=HTTPBasicAuth(user, password))

        # Pause until the data link finishes.
        check_job_status (configuration_id)

        # Edit Workbook JSON to include the new datalink
        shutil.copy(workbook_template_file, temp_file)
        do_change ("<LINK_NAME>", table_name, temp_file, temp_file2)
        with open(temp_file, 'r') as content_file:
            payload = content_file.read()

        # Issue command to create new workbook
        ret_val = requests.post("http://" + host + ":" + port + "/rest/workbook", auth=HTTPBasicAuth(user, password), data=payload)        

        # Display result to screen.
        print(ret_val)
        # Parse result to get configuration id.
        params = ret_val.json()
        configuration_id = str(params['configuration-id'])
        print (configuration_id)	

        # Now run the Workbook to do calculation.
        ret_val = requests.post("http://" + host + ":" + port + "/rest/job-execution?configuration=" + configuration_id, auth=HTTPBasicAuth(user, password))

        # Pause until the workbook finishes.
        check_job_status (configuration_id)

Back to the bash example.
Back to the python example.

Get Oracle Client Information with Powershell

Leave a comment

I spent a couple of hours this morning coming up with a way to see which of a large number of windows hosts have Oracle client installed. In particular, I wanted the path to the Oracle home for each machine. We have some 3rd party agent-based tools that gather data about windows machines, but licensing prohibits installing them everywhere.

Eventually I decided to create a powershell script that iterates through a list of windows hosts and outputs the value of all ORACLE_HOME entries in the registry plus the main PATH variable. A number of machines are likely to have more than one Oracle client. Or they may have multiple values for ORACLE_HOME even though some have been uninstalled. Getting the PATH value will allow me to see which one was installed most recently, since the most recently installed client should be foremost in the PATH.

I’ve posted the powershell script below. It requires an input file and creates an output file that can be opened in Excel. It illustrates using FOREACH loops, invoking the REG QUERY command against remote machines, using IF logic, and reading and writing to a file and some grep style filtering.

# Written with powershell 2.0. Not sure if it works with 1.0.

# File containing list of hosts.
$host_inputfile = "host_list.txt"

# Name of output data file and of log file.
$ofile = "ohome_values.txt"
$olog = "ohome_run.log"

# Initialize counters for hosts and Oracle Homes.
$total_hosts = 0
# Total homes found
$total_homes = 0

# Update log file to say that processing is starting.
echo "Starting new run of get_oracle_client." | out-file $olog -append
$start_time = get-date
echo $start_time | out-file $olog -append

# Start output data file by putting a header row in.
echo "HOSTNAME, KEY_NAME, KEY_TYPE, KEY_VALUE" | out-file $ofile

# Read into a variable the file containing the list of hosts.
$host_list = get-content $host_inputfile

# Iterate through each host in the list.
foreach ($HOST_NAME in $host_list) {

	# Update the user on progress.
	echo "Working on $HOST_NAME..."

	# Run a a command to get a list of all values from the registry for
	# ORACLE_HOME. Several strings need to be excluded, such as
	# ORACLE_HOME_NAME and agent10g.

	$OH_RAW = reg query \\$HOST_NAME\HKLM\Software\Oracle /s | select-string "ORACLE_HOME" 

	$OH = echo $OH_RAW | sort | get-unique | select-string -notmatch "ORACLE_HOME_KEY", "ORACLE_HOME_NAME", "agent10g"

	# Write Oracle Home values to file. Include a column for the host name.
	# Iterate through each line returned from the remote machine.
	foreach ($line in $OH) {

		# Initialize output var
	  $out_var = "$HOST_NAME"

		$split_line = -split $line

		foreach ($word in $split_line) {
			$out_var="$out_var,$word"
		}

		# If there was nothing to be processed, then don't write to file.
		if ($out_var -eq "$HOST_NAME,") {
			break
		}

		# Increment Oracle Homes counter variable and write values to file.
		$total_homes = $total_homes + 1
		echo $out_var | out-file $ofile -append

	}

	# Key for getting PATH:
	# HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Environment

	$path_var = reg query "\\$HOST_NAME\HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Environment"
	$path_var = echo $path_var | select-string "Path"
	$path_var = echo $path_var | select-string -notmatch "PATHEXT", "TRACE_FORMAT_SEARCH_PATH", "PSModulePath", "Os2LibPath"

	# Write path value to file. Include a column for the host name.
	foreach ($line in $path_var) {

		# Initialize output var
	  $out_var = "$HOST_NAME"

		$split_line = -split $line

		foreach ($word in $split_line) {
			$out_var="$out_var,$word"
		}

		# If there was nothing to be processed, then don't write to file.
		if ($out_var -eq "$HOST_NAME,") {
			break
		}		

		# Write values to file.
		echo $out_var | out-file $ofile -append
	}

	# Increment hosts counter variable.
	$total_hosts = $total_hosts + 1

}

$end_time = get-date

# Update screen.
echo "Start time was: $start_time."
echo "End time was: $end_time."
echo "$total_hosts hosts processed."
echo "$total_homes Oracle Homes found." 

# Append to log file.
echo "Ending run of get_oracle_client." | out-file $olog -append
echo "$total_hosts hosts processed." | out-file $olog -append
echo "$total_homes Oracle Homes found." | out-file $olog -append
echo $end_time | out-file $olog -append

Using Powershell for Oracle DBA scripts

2 Comments

I needed to deploy an RMAN backup script on a Windows OS machine and couldn’t to use OEM in the target environment. So I decided to try out powershell rather than use BAT and VBScript. Powershell proved to be a handy, easy to use scripting environment.

While powershell is object based and is tightly plugged into both the Windows OS and .NET, I took the approach of writing a straight-forward script similar to a unix shell script.

To execute a powershell script, you create a text file that ends with the extension PS1, e.g. backup_script.ps1. Here’s how to run the script from the windows command line.

C:\> powershell -file backup_script.ps1

I learned that powershell scripts can have signed security certificates to prevent rogue scripts from causing damage. Powershell can run in different security modes, ranging from restricted (interactive use only, no scripts) to unrestricted, where you get a prompt in order to run a script from the internet. I settled for “RemoteSigned” mode, where scripts downloaded from the internet must be signed, but not ones that you create. Default mode is restricted, so you’ll need to run the following from the powershell prompt to get local scripts to run:

PS C:\> set-executionpolicy remotesigned

Powershell supports two things that help out the unix shell scripter: aliases and pipelines. There are built-in aliases so that you can use familiar commands like ls and echo for the equivalent powershell commands that have much longer names. The powershell commands behave differently with a different set of command switches, but it’s still helpful to have the aliases. The pipeline support allows one to pipe virtually all powershell commands, unlike the meager pipeline support for windows bat commands.

In powershell 2.0 there’s an email command, send-mailmessage, which DBAs will find handy for monitoring and reporting.

The help system is built in to the powershell command line. Typing

PS C:\> help

gets you a complete list of commands and aliases.
And typing

PS C:\> help specific-command

gets you help on that command. There are also switches on the help command for examples, details and full help, e.g. help specific-command -examples.

What’s radically different about powershell from unix style shell languages is that it’s object based. When you pipe a command to another command, you are piping an object, not text. The receiving command can then reference properties of the object that is passed.

One item not supported in powershell: input redirection. This is a mainstay in unix shell scripting. It’s a handy way to keep your sql statements embedded in your shell script and to use variables derived from the shell script directly in the sql without passing them as positional arguments. By input redirection, I mean the following bit of unix scripting:

#!/bin/sh
sqlplus /nolog << EOF
connect $user/$pass@$sid 
select * from table
where f1 = $another_var;
exit
EOF

Here’s a pared down sample of the powershell script I created. This sample does not do anything with the .NET or object based attributes of powershell, with the exception of the line that deletes old files.

It illustrates the following:
> Getting command line arguments
> Formating a date
> String concatenation, done a la unix shell without a concatenation operator
> Assigning variables
> Launching an external command (nothing special, just run it)
> Sending email
> Deleting files older than a certain date

# Get script input params
$ORACLE_SID=$args[0]
$LOG_DIR=$args[1]
$BIN_DIR=$args[2]

# Assemble other variables
$datetime = get-date -format "MMM.dd.yyyy-HH.m.ss"
$RMAN_TYPE=”disk_backup”
$RMAN_LOG = "$LOG_DIR\$ORACLE_SID.$RMAN_TYPE.$datetime.rman.log"

# Start logging to file and then run RMAN command. Echo is an alias for Write-Output.
$newline = "Start Execution of $RMAN_TYPE backup on $ORACLE_SID"
echo $newline | out-file $RMAN_LOG
get-date | out-file $RMAN_LOG -append
rman cmdfile=$BIN_DIR\rman_backup_disk_compressed.rman using $ORACLE_SID | out-file $RMAN_LOG –append

# Send email. Note that cat is an alias for Get-Content.
$body = cat $RMAN_LOG | out-string 
send-mailmessage -to "dba1@firm.com" -from "dba2@firm.com" -body $body -subject "RMAN log output" -smtpserver mailserver

# Remove files older than 8 days.
# Note that ls is an alias for Get-ChildItem and del is an alias for Remove-Item.
# Also, ? is an alias for Where-Object and $_ is a built-in variable 
# to refer to the collection of objects passed from the previous pipeline.
ls $LOG_DIR\$ORACLE_SID.$RMAN_TYPE.* |? {$_.LastWriteTime -lt (get-date).AddDays(-8)} | del

Getting Rails Working: Vista, mongrel, sqlite

Leave a comment

For kicks I wanted to Ruby on Rails working. Here’s the combination of stuff I was using:

  • Windows laptop running Vista.
  • Mongrel web server which I think comes with default install of rails
  • sqlite database. I didn’t want to put mysql or something else on this machine quite yet.

Here are the steps I followed.

  1. Went to http://www.rubyonrails.org. Clicked on the Download tab and followed the steps to download stuff.
  2. First item was ruby, which I installed to C:\ruby. Ruby comes with an installer. It was ruby186-26.exe for the version I downloaded.
  3. Then it was on to gems. You download the file, extract it, then run ruby setup.rb.
  4. I installed rails. That just involves typing in gem install rails –include-dependencies.
  5. Step 4 above might have been enough to install mongrel, a web server that rails can use. However, I ran gem install mongrel just to be sure.
  6. To get sqlite working, I downloaded sqlite from www.sqlite.org. I put sqlite3.exe, sqlite3.dll and sqlite3.def in c:\sqlite. HOWEVER, to get things working with ruby and rails, I copied these three files to c:\ruby\bin. THIS was critical.
  7. Then install the ruby/sqlite interface. Do that with: gem install sqlite3 .
  8. At this point I was ready to run rails. I made a new app people by doing
    the following:

Yea!!! It worked.