Dynamically Create JSON

Leave a comment

My challenge: automate creating a JSON file that defines a Datameer import job operating on a fixed width text file. Why bother? Current state, the Datameer screen for working on a fixed width format with many columns can be time consuming.

Python and the Datameer REST API to the rescue.

I learned some basic python goodness while doing this.

1. Choose either single quotes or double quotes when assigning a value to a Python string variable. Super useful when the string contains one of those characters inside.

# This is a string.
my_var = 'Long value with " inside of it'
# And this is a string too.
my_other_var = "this has a ' inside of it."

2. You can assign a variable to a multi-line string variable that contains all kinds of single and double quotes simply by enclosing the value with either triple single quotes or triple double quotes. Putting the letter r makes the string “raw”, allowing for other tricky character that would require an escape character before them.

my_var = r'''This string has " in it and is
               broken out over several
             lines and even has \ and ' inside of it.'''

3. When appending elements to a JSON array object, the python dict data structure solves tricky quoting issues. Also, note how the values True and False are capitalized. The python dict object needs these boolean values capitalized, but the json append method will convert them to true and false, which is the correct capitalization for JSON.

json_data["fields"].append(dict(id = counter, name =field_name, start = start_pos, 
      origin = origin_str, chars = num_chars, end = end_pos, acceptEmpty = True, 
      include = True, version = 3, valueType = field_type))

4. I used an older python string substitution method to concatenate variable values into a really long string. In the string value, you place the value %s. You can place this multiple times throughout the string. At the end of the string, list out, in order, the variables that will replace the %s markers.

my_var1 = "my_filename"
my_var2 = "Employee Module"
data  = r'''{
 "version": "9.0",
 "className": "com.department.employee",
 "file": {
 "path": "%s.exe",
 "description": "",
 "name": "%s"
 }
}''' % (my_var1, my_var2)

The logic of the script is straightforward.

OPEN A FILE CONTAINING THE FIELD DEFINITIONS FOR THE FIXED WIDTH FILE

ASSEMBLE A JSON OBJECT CONTAINING ALL THE JSON NEEDED, MINUS THE SPECIFIC FIELD DEFINITIONS. CONCATENATE CUSTOMIZABLE PARTS LIKE PATH AND OBJECT_NAME.

ITERATE THROUGH THE FIELD DEF FILE AND ADD ALL THE FIELDS TO THE FIELD ARRAY IN THE JSON OBJECT

WRITE OUT THE COMPLETE JSON IMPORT JOB DEFINITION TO DISK

RUN THE REST API CALL TO CREATE A NEW IMPORT JOB USING THE DYNAMICALLY ASSEMBLED JSON

Here’s the script.

#######################################################################################
# Python 2.7 script to create a JSON definition file for a Datameer import job
# that operates on a fixed width data source.
#######################################################################################
# Note that the output JSON file is not formatted in the order that Datameer would 
# order it. Correctly formatted JSON is not dependent on order. Python's JSON 
# library doesn't maintain the order the JSON was assembled in, but  Datameer will
# ingest the file correctly. When Datameer renders the JSON created by this script, it 
# will format it in the usual order.

# Needs a field_def file that contains the following
# 1. Field name
# 2. Start position of field
# 3. End position of field
# 4. Data type of field

# Python libraries needed for this script.
import json
import csv
import sys
import requests

# Host and credentials
datameer_server_url = 'http://localhost:7996/rest/import-job'
user = 'admin'
password = 'admin'

# Check command line arguments.
if len(sys.argv) != 8:
     print 'This script builds a Datameer JSON file for an import job using the fixed-width'
     print 'format. It takes 7 arguments:'
     print '1. The name of the object to create'
     print '2. The file that contains field defs'
     print '3. The name for the output JSON file'
     print '4. Include meta data fields? Y or N'
     print '5. Datameer path where you want to put object'               
     print '6. Full Datameer path and name of Connection object to use'
     print '7. Source location, can include folders, filename and wild cards'                           
     print 'Example: $ python fixed_width_create.py my_name fdefs.csv out.json Y /Data/ImportJobs /Data/Connections/my_connection.dst fixedwidth/*'
     sys.exit()

# Put command line arguments into variables
object_name = sys.argv[1]
field_def = sys.argv[2]
output_filename = sys.argv[3]
do_dasMetaFields  = sys.argv[4]
object_path = sys.argv[5]
connection_path_and_file = sys.argv[6]
source_path = sys.argv[7]

# Initial JSON data that defines the import job. The fields array is empty.
# Uses positional string replacement parameter: %s 
# This is a multi-line string that uses the 3 apostrophe notation. 
# Also, the letter r precedes the apostrophes to make the string "raw", which handles
# all the characters that might otherwise need escape characters.
data = r'''{
  "version": "5.11.14",
  "className": "datameer.dap.common.entity.DataSourceConfigurationImpl",
  "file": {
    "path": "%s/%s.imp",
    "description": "",
    "name": "%s"
  },
"properties": {
    "GenericConfigurationImpl.temp-file-store": [
      "6e67fa8f-7f24-4d4c-adbb-dddabe70d19a"
    ],
    "fileType": [
      "FIXED_WIDTH"
    ],
    "filter.page.does.split.creation": [
      "false"
    ],
    "file": [
      "%s"
    ],
    "detectColumnDefinition": [
      "SELECT_PARSE_AUTO"
    ],
    "ignore.first.n.lines": [
      "0"
    ],
    "fileNameTimeRange_mode": [
      "OFF"
    ],
    "filter.maxAge": [
      ""
    ],
    "filter.minAge": [
      ""
    ],
    "characterEncoding": [
      "UTF-8"
    ],
    "recordSampleSize": [
      "1000"
    ],
    "ObfuscatedColumns": [
      ""
    ],
    "TextFileFormat": [
      "TEXT"
    ],
    "incrementalMode": [
      "false"
    ]
  },
  "hadoopProperties": "",
  "dataStore": {
    "path": "%s"
  },
  "errorHandlingMode": "DROP_RECORD",
  "maxLogErrors": 1000,
  "maxPreviewRecords": 5000,
  "notificationAddresses": "",
  "notificationSuccessAddresses": "",
  "fields": [
  ]
}''' % (object_path, object_name, object_name, source_path, connection_path_and_file)

# Above line shows the values getting replaced.
 
# Load the string python variable into a JSON object. 
json_data = json.loads(data)

# Print out results to screen for debugging.
print(json_data)

# Open the list of fields/widths/data types
with open(field_def,'r') as field_list:

     # Now read the variable with the csv library.
     csv_f = csv.reader(field_list)

     # Read past the header row.
     csv_f.next()

     # Variable to use in creating an ID value for columns in the JSON field array.
     counter = 0

     # Assign column values to variables and get length value.
     for row in csv_f:
          #print row[2]
          counter = counter + 1
          field_name = row[0] 
          start_pos = int(row[1]) - 1
          end_pos = int(row[2])
          num_chars = end_pos - start_pos
          origin_str = str(start_pos) + "," + str(end_pos)
          data_type = row[3]
          field_type = '{\"type\":\"' + data_type + '\"}'

          # Use the variables to append to the JSON object.
          json_data["fields"].append(dict(id = counter, name =field_name, start = start_pos, origin = origin_str, chars = num_chars, end = end_pos, acceptEmpty = True, include = True, version = 3, valueType = field_type))

     # Check if user wanted the meta fields.
     if do_dasMetaFields == 'Y': 
  
          counter += 1
          field_name = 'dasFileName'
          origin_str = 'fileInfo.fileName'
          field_type = '{\"type\":\"STRING\"}'
          json_data["fields"].append(dict(id = counter, name = field_name, origin = origin_str, acceptEmpty = False, include = True, version = 3, valueType = field_type))

          counter += 1
          field_name = 'dasFilePath'
          origin_str = 'fileInfo.filePath'
          field_type = '{\"type\":\"STRING\"}'
          json_data["fields"].append(dict(id = counter, name = field_name, origin = origin_str, acceptEmpty = False, include = True, version = 3, valueType = field_type))

          counter += 1
          field_name = 'dasLastModified'
          origin_str = 'fileInfo.lastModified'
          field_type = '{\"type\":\"DATE\"}'
          json_data["fields"].append(dict(id = counter, name = field_name, origin = origin_str, acceptEmpty = False, include = True, version = 3, valueType = field_type))

          counter += 1
          field_name = 'dasExecutionId'
          origin_str = 'jobexecutionInfo.id'
          field_type = '{\"type\":\"INTEGER\"}'
          json_data["fields"].append(dict(id = counter, name = field_name, origin = origin_str, acceptEmpty = False, include = True, version = 3, valueType = field_type))
 
          counter += 1
          field_name = 'dasJobExecutionStart'
          origin_str = 'jobexecutionInfo.starttime'
          field_type = '{\"type\":\"INTEGER\"}'
          json_data["fields"].append(dict(id = counter, name = field_name, origin = origin_str, acceptEmpty = False, include = True, version = 3, valueType = field_type))
 									
# Write JSON data into a file using variable names from top of script.
# Use the method called json.dump()
# It's just dump() and not dumps()
# Encode JSON data
with open(output_filename, 'w') as f:
     json.dump(json_data, f)

# Also, create a new Datameer object within Datameer using the Datameer REST API    
ret_val = requests.post(datameer_server_url, auth=(user, password), data = json.dumps(json_data))     

Automate running the same Datameer Workbook on different data sets

Leave a comment

Datameer is an extremely user friendly software platform for self-service end-to-end big data discovery and analytics. You can do just about anything in the realm of descriptive analytics using the Datameer GUI. You can automate a complex series of steps just by pointing and clicking with software that is as easy to use as Microsoft Office.

But there are some activities that require using the Datameer REST API. A prospect recently asked me “How can I run the same Datameer workbook on 50 different data sets that share the same column layout without doing the same thing 50 times manually using the GUI?” The answer is creating a script that invokes the Datameer REST API. The beauty of a REST API is that you can use the scripting language of your choice to tackle the problem. I ended up creating sample scripts in both Bash and python to illustrate that you can pick your language.

The first order of business is defining the problem. In this case, we needed a mechanism to iterate over a set of data sources all of which shared the same column layout. While doing that iteration, the script needs to update the data source of the Datameer Workbook in question. In fact, the script also had to create a new data source first, and then update the workbook data source. And finally, while iterating through the list, the script needs to execute the newly updated workbook. Further, I chose to implement a mechanism to check that the workbook had completed prior to starting the next workbook, rather than move on to the next cycle without waiting for the workbook job to complete. The data source I used in Datameer is called a Data Link. Data Links extract a sample data set from the source, without duplicating the entire source set. When you prefer to not only extract a sample but also get the entire data set available in Datameer’s HDFS folder, then use Import Jobs.

Pseudo code for my example looks like this:

WHILE LOOPING THROUGH LIST OF DATA SOURCES

CREATE A NEW DATA LINK for the CURRENT DATA SOURCE

UPDATE THE WORKBOOK DEFINITION TO USE THE NEW DATA LINK

RUN THE WORKBOOK

CHECK THAT THE WORKBOOK IS DONE

I ended up creating two functions for steps that got repeated a lot, which were 1) updating JSON files that contain the definitions of Data Links and Workbooks, and 2) checking on job status.

Notes on doing a REST API call:
In bash: I used the curl command to do the REST call. I also used back ticks to place the output of the curl command into a variable. The line of code looks like this:

# Note the use of variables to store username, password, host, port, etc.
# Use of the correct quotes is important.
RET_VAL=`curl -u $USER:$PASSWORD -X GET "http://$HOST:$PORT/rest/job-configuration/job-status/$CONFIGURATION_ID"`
# Unfortunately, I chose a pretty ugly method to parse the variable RET_VAL. A combo of grep, awk and tr.
STATUS=`echo "$RET_VAL" | grep "jobStatus" | awk -F: '{print $2}' | tr -d '[:space:]' | tr -d \"`

In python: I used a python module called requests to invoke Datameer’s REST API.

# The python implementation is cleaner. Run the request, using the auth parameter and the HTTPBasicAuth function to handle authentication.
ret_val = requests.get("http://" + host + ":" + port + "/rest/job-configuration/job-status/" + configuration_id, auth=HTTPBasicAuth(user, password))
# Then convert the return value to JSON.
params = ret_val.json()
# And finally pull out the parameter that you want.
status = params['jobStatus']

Below you can find the source code in both Bash and python.

The Bash example.


#!/bin/bash

# Example script for creating several datalinks on similar tables and then creating
# workbooks to run on those data links.

# Script assumes you have one working data link and one working workbook, and have
# modified the underlying JSON to include some tags that sed can look for and replace.

# Datameer environment parameters
USER=your_user
PASSWORD=your_password
HOST=localhost
PORT=7996

# Script parameters
# Parameter 1: DATALINK_TEMPLATE_FILE - Contains datalink JSON to be modified. Script does not alter
#              this file. Instead, it creates a temporary file and runs that.
# Parameter 2: WORKBOOK_TEMPLATE_FILE - Contains workbook JSON to be modified. Script does not alter
#              this file. Instead, it creates a temporary file and runs that.
# Parameter 3: LIST_OF_TABLES - Contains a list of the target tables to be processed.

# Assign parameters to global variables.
DATALINK_TEMPLATE_FILE=$1
WORKBOOK_TEMPLATE_FILE=$2
LIST_OF_TABLES=$3

# Temporary file that contains edits when they happen.
TEMP_FILE=zzz_temp_file.json

# Sub Routine: do_change. Runs sed utility to do update.
# Positional arguments
# 1. Value to look for.
# 2. New value to put in.
do_change () {
# Get positional arguments for function.
CUR_VAL=$1
NEW_VAL=$2

	echo "Updating value $CUR_VAL with $NEW_VAL..."
	sed -e "s/$CUR_VAL/$NEW_VAL/g" -i bak $TEMP_FILE
}

# Sub Routine: check_job_status. Stay in loop until job completes.
# Positional arguments
# 1. Job ID.
check_job_status () {
# Get positional arguments for function.
CONFIGURATION_ID=$1

	echo "Checking job status for $JOB_ID"
	echo "'http://localhost:7996/rest/job-configuration/job-status/$JOB_ID'"

    # Run loop until status shows job is done.
    while [[ 1 == 1 ]]; do
	    RET_VAL=`curl -u $USER:$PASSWORD -X GET "http://$HOST:$PORT/rest/job-configuration/job-status/$CONFIGURATION_ID"`
    	STATUS=`echo "$RET_VAL" | grep "jobStatus" | awk -F: '{print $2}' | tr -d '[:space:]' | tr -d \"`
    	echo $STATUS

    	sleep .5

    	if [[ $STATUS = "COMPLETED" || $STATUS = "ERROR" || $STATUS = "COMPLETED_WITH_WARNINGS"  || $STATUS = "ABORTING" || $STATUS = "_NONE" ]]
    	then
    		break
    	fi
    done
}

# Main loop for running through all tables in the list.
while read table_name; do

	# Edit Datalink JSON
	cp $DATALINK_TEMPLATE_FILE $TEMP_FILE
	do_change "
<TABLE>" $table_name $DATALINK_TEMPLATE_FILE
	do_change "" $table_name $DATALINK_TEMPLATE_FILE

	# Issue command to create new Datalink. Capture result in variable.
    RET_VAL=`curl -u $USER:$PASSWORD -X POST -d @$TEMP_FILE "http://${HOST}:${PORT}/rest/import-job"`
	# Display result to screen.
    echo "$RET_VAL"
    # Parse result to get configuration id.
    CONFIGURATION_ID=`echo "$RET_VAL" | grep "configuration" | awk -F: '{print $2}' | tr -d '[:space:]'`

	#Display result to screen.
	echo $CONFIGURATION_ID

	# Now run the Datalink to get sample data.
	curl -u $USER:$PASSWORD -X POST "http://${HOST}:${PORT}/rest/job-execution?configuration=${CONFIGURATION_ID}"	

	# Pause until the data link finishes.
	check_job_status $CONFIGURATION_ID	

	# Edit Workbook JSON to include the new datalink
	cp $WORKBOOK_TEMPLATE_FILE $TEMP_FILE
	do_change "" $table_name $WORKBOOK_TEMPLATE_FILE	

	# Issue command to create new workbook
    RET_VAL=`curl -u $USER:$PASSWORD -X POST -d @$TEMP_FILE "http://$HOST:$PORT/rest/workbook"`
	# Display result to screen.
    echo "$RET_VAL"
    # Parse result to get configuration id.
    CONFIGURATION_ID=`echo "$RET_VAL" | grep "configuration" | awk -F: '{print $2}' | tr -d '[:space:]'`

	# Now run the new workbook.
	#Display result to screen.
	echo $CONFIGURATION_ID
	# Now run the Workbook to get sample data.
	curl -u $USER:$PASSWORD -X POST "http://${HOST}:${PORT}/rest/job-execution?configuration=${CONFIGURATION_ID}"

	# Pause until the workbook finishes.
	check_job_status $CONFIGURATION_ID	

done < $LIST_OF_TABLES

And the python example.

# Example script for creating several datalinks on similar tables and then creating
# workbooks to run on those data links.

# Script assumes you have one working data link and one working workbook, and have
# modified the underlying JSON to include some tags that sed can look for and replace.

import sys
import shutil
import requests
from requests.auth import HTTPBasicAuth
import json
import time

# Datameer environment parameters
user = "your_user"
password = "your_password"
host = "localhost"
port = "7996"

# Get input parameters
datalink_template_file = sys.argv[1]
workbook_template_file = sys.argv[2]
list_of_tables = sys.argv[3]

# Temp file to use for creating dynamic JSON to create Datameer objects.
temp_file = "./zzz_temp_file.json"
temp_file2 = "./zzz_temp_file.2.json"

# Function for updating JSON file contents
# Open one file for read, another for write. Make changes. At end, swap files.
def do_change(cur_val, new_val, file_name, file_name2):
    with open(file_name, "rt") as f_in:
        with open(file_name2, "wt") as f_out:
            for line in f_in:
                f_out.write(line.replace(cur_val, new_val))
    f_in.close()
    f_out.close()
    shutil.copy(file_name2, file_name)

def check_job_status(configuration_id):
    print ("Checking job status for" + configuration_id)
    print ("'http://localhost:7996/rest/job-configuration/job-status/" + configuration_id + "'")

        # Run loop until status shows job is done.
    while 1 == 1:
        ret_val = requests.get("http://" + host + ":" + port + "/rest/job-configuration/job-status/" + configuration_id, auth=HTTPBasicAuth(user, password))
        params = ret_val.json()
        print (params)
    	status = params['jobStatus']
        print (status)
    	time.sleep(1)

    	# Check status and break out of loop if when job is done.
    	if (status == "COMPLETED") or (status == "ERROR") or (status == "COMPLETED_WITH_WARNINGS") or (status == "ABORTING" ) or (status == "_NONE"):
    		break

# Main loop iterating through list of tables.
with open(list_of_tables,'r') as table_file:

    for table_name in table_file:

        table_name = table_name.rstrip()
        if not table_name: continue

        print ("Copying " + datalink_template_file + " for " + table_name)
        # Copy the template file to a temporary working file.
        shutil.copy(datalink_template_file, temp_file)

        do_change ("<TABLE_NAME>", table_name, temp_file, temp_file2)
        do_change ("<LINK_NAME>", table_name, temp_file, temp_file2)

        with open(temp_file, 'r') as content_file:
            payload = content_file.read()

        # Add new datalink to Datameer.
        ret_val = requests.post("http://" + host + ":" + port + "/rest/import-job", auth=HTTPBasicAuth(user, password), data=payload)
        print (ret_val.json())
        params = ret_val.json()
        configuration_id = str(params['configuration-id'])
        print (configuration_id)

        # Now start the datalink to get the sample data.
        ret_val = requests.post("http://" + host + ":" + port + "/rest/job-execution?configuration=" + configuration_id, auth=HTTPBasicAuth(user, password))

        # Pause until the data link finishes.
        check_job_status (configuration_id)

        # Edit Workbook JSON to include the new datalink
        shutil.copy(workbook_template_file, temp_file)
        do_change ("<LINK_NAME>", table_name, temp_file, temp_file2)
        with open(temp_file, 'r') as content_file:
            payload = content_file.read()

        # Issue command to create new workbook
        ret_val = requests.post("http://" + host + ":" + port + "/rest/workbook", auth=HTTPBasicAuth(user, password), data=payload)        

        # Display result to screen.
        print(ret_val)
        # Parse result to get configuration id.
        params = ret_val.json()
        configuration_id = str(params['configuration-id'])
        print (configuration_id)	

        # Now run the Workbook to do calculation.
        ret_val = requests.post("http://" + host + ":" + port + "/rest/job-execution?configuration=" + configuration_id, auth=HTTPBasicAuth(user, password))

        # Pause until the workbook finishes.
        check_job_status (configuration_id)

Back to the bash example.
Back to the python example.

Rebuilding Yum Database

Leave a comment

When starting Hadoop components on the Linux virtual machines on my laptop, I periodically get failures at startup. The culprit? A corrupt yum database. I’ve been running Hortonworks 2.0 recently and have used the Ambari GUI to start services. The failure I encounter looks like this:

notice: Finished catalog run in 0.65 seconds
err: /Stage[1]/Hdp/Hdp::Lzo::Package[64]/Hdp::Package[lzo 64]/Hdp::Package::Process_pkg[lzo 64]/Package[lzo]/ensure: change from absent to present failed: Execution of '/usr/bin/yum -d 0 -e 0 -y install lzo' returned 1: rpmdb: Thread/process 3570/140380093466368 failed: Thread died in Berkeley DB library
error: db3 error(-30974) from dbenv->failchk: DB_RUNRECOVERY: Fatal error, run database recovery
error: cannot open Packages index using db3 -  (-30974)
error: cannot open Packages database in /var/lib/rpm
CRITICAL:yum.main:

Error: rpmdb open failed

The following series of commands straighten out the situation I typically encounter.

# rm -f /var/lib/rpm/__db*
# db_verify /var/lib/rpm/Packages
# rpm --rebuilddb
# yum clean all

The command yum clean all will not work until the first 3 commands are run, should the case be the DB_RUNRECOVERY error show above.

vmware fusion fixed ip stuff

Leave a comment

As a presales engineer at a small big data company with our own database that runs on many platforms, I now use VMWare Fusion all the time.

Here’s some of the grunt work I do when creating new VMs for which I want a fixed IP address and a changed hostname.

If I make the new VM and it’s linux, I may use the VMWare rapid installation feature. Then I want to change the root password. That just requires a restart. During restart, hit any key on the keyboard to bring up the Linux GRUB screen. Then hit “e” on the keyboard to edit the start parameters. Use the down arrow to get to the line starting with “kernel”. Then hit e again to edit that line. Delete the last two words, “rhgb quiet” and replace them with “1”. Then hit enter to preserver your change and then b to boot. You will start in single user mode as root. From there, you can use the command “passed” to provide a new password for user root.

Then to change the hostname, I use these three steps on Centos Linux, after becoming root of course.
1. Use the hostname command to change the host name.
[root@localhost ~]# hostname yournewhostname.local

2. Edit /etc/sysconfig/network to include the new hostname.
NETWORKING=yes
HOSTNAME=yournewhostname.local

3. Back on my Mac, I figure out what the static IP address will be. With VMWare Fusion 5 (and apparently 4 or higher) this involves editing the following file:
/Library/Preferences/VMware Fusion/vmnet8/dhcpd.conf

Here’s a great write up on what to do.

4. Edit /etc/hosts to include the new hostname. Also include the static ip you chose in step 3.

5. I’ve been using Centos for my VMs. I’ve been using the GUI “Network Connections” tool under the System – Preferences menu to create the static ip address. On the first screen, select your ethernet card. Mine shows up as eth0 on the Centos Linux VMs I’ve created. Then click the Edit button.

On the resulting screen, I click on the IPv4 tab.

Then I switch the method to Manual. I click Add and make the address my new static IP that I chose in step 3 above. Netmask is 255.255.255.0. The gateway is the IP address of your Mac. I also make the DNS server the IP address for my Mac.

Expire and Restore with RainStor

Leave a comment

Tough Business Problems

A couple of business problems plague companies that maintain large amounts structured database data.

The first is how to expire older records from systems once they are no longer needed for compliance reasons. This becomes difficult because running DELETE statments via SQL can incur performance problems, especially on very large tables. Another issue is being able to verify that deletes actually took place, and that they took place at the desired time.

The second is how easily and inexpensively to fulfill requests from auditors who ask “I’d like to see the exact state of your database X number of years ago.” The larger the database, the more difficult this process becomes. To restore from a database backup, users need to supply sufficient disk space for the newly restored system. They need to supply the exact database version in use in the past, as well as the exact operating system.

The RainStor Advantage

RainStor’s database is tailored for Big Data amounts of structured and semi-structured data. RainStor routinely compresses data at levels of 20x to 40x. And it has outstanding features that address the two business problems mentioned at the start of this peice. RainStor is a database oriented toward archival and data warehouse use cases. It is an append only database into which users bulk load data. There are no SQL INSERT or UPDATE statements, for example. However, RainStor ingests bulk loaded data extremely rapidly, compresses it enormously, scales out to hundreds of nodes if needed. It’s licensing model makes for very attractive pricing compared to other database vendors. And when users put their data in RainStor, they get tremendous control over when data is expired and how they present older states of the database.

Record Retention and Expiration with RainStor

Let’s first talk about expiring older records within RainStor.

A RainStor database system is divided up into archives, schemas and tables. An archive is similar to a database within Sybase, SQL Server or DB2. An archive can contain many schemas. And each schema may own many tables.

Within RainStor, users can set a retention period for how many days or months records within a table may be kept. The retention period may be set at the archive or table level. In fact, it can also be set a system-wide level. Archive level retention period settings override the system level. And table level retention period settings override the archive ones.

LOGICAL DATE: The Key

How does RainStor know how old a record is? It checks the LOGICAL DATE set for the record. The LOGICAL DATE is automatically set when the group of records get ingested into RainStor. If the LOGICAL DATE is not set explicitly by the user, it will simply match the wall clock date. To set the value explicity for LOGICAL DATE, all one has to do is alter the name of the file to be loaded. Instead of naming a table my.table.bcp, just name it my.table.2013-10-17.bcp. RainStor will take care of the rest from there. In fact, in RainStor, one does not even have to run a utility to get data loaded, but that’s a story for another day.

Now to finish up how records get expired. RainStor simply compares the logical date of records against the retention period value that was set. If the records are older then the retention period, they are removed from the system.

The best part is that the expiration is super fast. It is not a SQL DELETE statement. Rather, in RainStor, all data files for tables contain records with the same LOGICAL DATE. Therefore, expiring records means only removing a file at the operating system level. No SQL DELETE. Super fast.

All That and Easy Reporting Too

Lastly, all expiration of records are tracked within RainStor within internal system tables that are easy to report upon. Users can furnish auditors with timely reports that prove outdated date gets removed from RainStor in a timely fashion.

RainStor and Database Backup and Recovery

A common use case for deploying RainStor is to eliminate or vastly reduce dependancy on traditional database backup. In some RainStor deployments, users replace an existing database system with RainStor. In other cases, they feed RainStor with data from an on line transcation processing (OLTP) database system. In either case, RainStor can help eliminate or reduce backups.

LOGICAL DATE Again

Earlier, this article discussed how LOGICAL DATE makes it possible to expire database records. LOGICAL DATE is also the linchpin for running queries that show the state of the database in the past.

In order to see only the rows that were present in the past, users need only run a single line of code.

SET ARCHIVE TIME 2011-11-01;

Doing so will mean that any subsequent query will not return records that have LOGICAL DATES more recent than November 1, 2011.

The SET ARCHIVE TIME functionality is available both in RainStor’s command line query tool and also in RainStor’s JDBC and ODBC drivers.

But What About Schema Changes?

RainStor also takes this concept a step further in a way that is unique on the market today.

Whenever users make a DDL change to add/drop/modify columns or add/drop/modify tables, RainStor also attaches a LOGICAL DATE value to the DDL change. This gives RainStor the amazing ability to run point-in-time queries in the past that not only return just the records from that time but also the combination of columns and tables from that time.

The syntax for doing this? It’s as easy as:

SET ARCHIVE TIME AND SCHEMA 2011-11-01;

Wrapping Up

In summary, RainStor is a uniquely valuable database product. It makes possible enormous reductions in cost on database storage. It enables companies to enforce easily data retention policies so that businesses can enforce data retention SLAs. And it also dramatically reduces cost and complexity in database backup and recovery infrastructure.

Privileges Needed to Backup Databases

Leave a comment

Creating backups are one of the most vital tasks in maintaining computer infrastructure. System administrators and Database Administrators in particular get saddled with this unglamorous task. Backing up an operating system (OS) generally does not require any special manipulation of the OS prior to sending OS files to a backup target such as tape, disk or deduplicated disk targets like Data Domain. A system administrator can do things the old fashioned way such as generating a tar file and sending it to the backup target. Or he or she can deploy sophisticated backup software such as EMC’s Avamar to administrate and automate the backup process across hundreds or thousands of computers.

In any event, backing up the OS generally requires the root or Administrator level privilege. And this is rarely a problem because the System Administration teams runs the backups, and those users typically have that privilege anyway.

Backing up databases presents a variety of challenges. One is that for many database systems the Database Administrator (DBA) must place the database in a special mode prior to creating the backup. Another is that on many popular database systems, the DBA must backup files that store the record of transactions since these are used to permit recovery to any point in time. Such files are known as transaction logs on SQL Server, Sybase, Postgres and DB2. They are called archive logs in Oracle, although older Oracle documents called them offline redo logs.

And another challenge is that the privilege needed to backup a database varies among different database vendors. If the DBA team is running the backups, the level of privilege needed within the database is not a problem. But if the database backups are getting run by another team, then the DBA team may be reluctant to give out a high level of privilege in order to backups. What to do? To date, there is not much flexibility in the privilege needed to do the database backup.

Here’s a table spelling out backup privileges needed by an internal database user running a backup.

DATABASE PRIVILEGES NEEDED BY DATABASE USER RUNNING THE DATABASE BACKUP OTHER CONSIDERATIONS
ORACLE The database user executing the RMAN backup needs the SYSDBA role. The users SYS and SYSTEM have this role. Or a new user can be created, but it must have SYSDBA.
If using an RMAN catalog, the database user that connects to the RMAN catalog does NOT have to have SYSDBA. It does need the RECOVERY_CATALOG_OWNER role. In other words, one use can run the backup and another user can record the activity in the RMAN catalog.

And what about Oracle 12C? It’s possible that the privilege needed to backup the new pluggable database model will no longer need SYSDBA. I’ll update this post when 12C gets released.

SQL SERVER For an individual database, a SQL Server database user with the database level role db_backupoperator can perform a backup upon that database. NOTE that this privilege must be granted within each new database that gets added to the instance. The server level role sysadmin will allow a user to backup any database, even new ones. While db_backupoperator is much lower privileged than sysadmin, only sysadmin insures that new databases will get backed up. Using db_backupoperator requires explicitly granting the privilege on each new database.
SYBASE ASE The lowest level of privilege needed is the Operator role. Operator will permit backup and restore on any database in the instance.

The System Administrator (SA) privilege also provides this, but this is the highest level role.

Finally, database owner can do backup and restore, but this privilege must be granted individually for each database.

The Operator role will work with new databases, so it is a good choice.
DB2 You must have SYSADM, SYSCTRL, or SYSMAINT authority to use the backup and recover utilities in DB2. SYSADM is the most powerful role of the 3. SYSCTRL controls operations that affect system resources, but cannot update table data. SYSMAINT can perform maintenance operations on all individual databases within an instance, but like SYSCTRL, cannot update table data. SYSCTLR and SYSMAINT both offer a lower privileged account than SYSADM.

NOTE: I plan to add to the above table as I get more information on databases.

Creating an RMAN Recovery Catalog

Leave a comment

Need to create and RMAN Recovery catalog? Check the Creating a Recovery Catalog section of the Backup and Recovery User Guide document.

Steps listed there are:
1. Create a tablespace for the catalog
2. Create a user that will own the schema and assign to it the newly created tablespace.
3. Grant the RECOVERY_CATALOG_OWNER privilege to the user, and if 11g or higher, grant CONNECT as well.

SQL> create tablespace TOOLS datafile 'C:\ORADATA\RCATDB\TOOLS_01.DBF' size 100M;

Tablespace created.

SQL> create user RCAT identified by RCAT default tablespace TOOLS temporary tablespace TEMP quota unlimited on TOOLS;

User created.

SQL> grant recovery_catalog_owner to RCAT;

Grant succeeded.

SQL> grant connect to RCAT;

Grant succeeded.

4. Using the RMAN utility, login with the CONNECT CATALOG syntax as the recovery catalog schema owner and run CREATE CATALOG. This will build the schema. You can specify a tablespace if desired.


C:\Users\emcadmin> rman

Recovery Manager: Release 11.2.0.1.0 - Production on Wed Jan 2 18:30:33 2013

Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.

RMAN> connect catalog RCAT/RCAT

connected to recovery catalog database

RMAN> create catalog tablespace tools;

recovery catalog created

At this point one can begin registering within the catalog databases to be backed up.


RMAN> connect target;

RMAN> register database; 

SQL for checking RMAN catalog or control file for backup information

Leave a comment

Do you need a quick query to check the local control file to see if backups succeeded? Use the below if there is no RMAN catalog involved.

/*
Query to use when storing meta data in control file.
SID column will be 0 unless working with Oracle RAC.
*/
select sid, object_type, status, 
round((end_time - start_time) * 24 * 60, 2) duration_minutes,  
to_char(start_time, 'mm/dd/yyyy hh:mi:ss') start_time, 
to_char(end_time, 'mm/dd/yyyy hh:mi:ss') end_time,
round((input_bytes/(1024*1024*1024)),2) input_gb, 
round((output_bytes/(1024*1024*1024)),2) output_gb
from v$rman_status
where operation = 'BACKUP';

If you are using the RMAN catalog, then you can run this query instead.

/*
Query to use when storing meta data RMAN catalog.
*/
select db_name, object_type, status, 
round((end_time - start_time) * 24 * 60, 2) duration_minutes,  
to_char(start_time, 'mm/dd/yyyy hh:mi:ss') start_time, 
to_char(end_time, 'mm/dd/yyyy hh:mi:ss') end_time,
round((input_bytes/(1024*1024*1024)),2) input_gb, 
round((output_bytes/(1024*1024*1024)),2) output_gb
from rc_rman_status
where operation = 'BACKUP';
/*
ADD THE FOLLOWING TO LIMIT THE ROWS RETURNED B
where end_time  &gt; (SYSDATE - 31)
*/

quick peak at postgres post install on ubuntu

Leave a comment

I’m checking out how postgres works.

I installed it to 32-bit ubuntu version 12.04. Googling lead to multiple places for the appropriate apt-get commands.

The install created a user, postgres, that runs the postgres binaries. However, root owns the binary files.

A default instance got created. I haven’t learned yet if people often run multiple postgres instances per host.

The following processes run with the default installation:
/usr/lib/postgresql/9.1/bin/postgres
postgres: writer process
postgres: WAL process
postgres: autovacuum launcher
postgres: stats collector process

The first process is the main postgres process, and it was launched with the -D parameter pointing to a specific directory and the -c parameter pointing to the full path of the postgres.conf file.

The writer process I’m surmising must write to data files, and the WAL process I’ve read elsewhere is the Write Ahead Log, similar to redo log writer in Oracle. Autovacuum launcher governs the ability to automatically run the VACUUM command, which is needed in Postgres periodically. And I’m sure the stats collector updates query optimization stats, but I’ll have to check.

There’s a command psql that is the equal of sqlplus. I’ll explore psql in a follow up post.

Documentation for postgres can be found at postgresql.org.

Having worked as a SQL Server and Oracle DBA, keeping track of database storage is important. Documentation for those two products describes early on how each system places all objects into datafiles. A datafile can contain tables, indexes, stored procedures, views and everything else.

Postgres on the other hand relegates discussion of physical storage to a location fairly deep in the documentation. Each table gets its own datafile. A master directory tree contains all the object in the postgres database, with most objects getting their own separate file. And postgres dictates the directory structure, although perhaps in more advanced deployments users can control some aspects. The filenames have a number which is automatically generated by postgres. My instance installed to /var/lib/postgresql/9.1/main. There are multiple sub-directories below that.

Done writing for now, but I’m going to create some tables, bang around with psql and try out the gui admin tool pgadmin III.

learning postgres

Leave a comment

I work at EMC in the Backup and Recovery Services (BRS) division, and we use postgres. It powers our backup software catalog for Avamar. We use it as a database repository for Data Protection Advisor (DPA). And it was the first database to be virtualized automatically by VMware’s vFabric Data Director.

In the Big Data landscape, postgres pops up all the time. Greenplum uses it. Netezza uses it. Hadapt, a newcomer to the big data space uses it. I think maybe Platfora uses it but by this point my head is spinning and I can’t even remember where I read that. And Cloudera uses it to store management data.

And probably a bizillion other peices of software use it.

I’m interested in EMC and in big data so I’m going to start learning postgres.

I’ll finish up with the “What about mysql?” question. In general, I’d always read that mysql is easy to learn, fast by default and deployed widely for small web apps. And that postgres is slower, but more reliable and feature rich. Some recent browsing reminded me that MySQL has corporate backing, first from its original corporate owners, then from Sun and now from Oracle, which currently owns it. And Postgres remains 100% open source. Finally, mysql allows users to pick their storage engine while Postgres provides for just one.

mysql vs postgres links for reference:
One on wikivs.com
A stackoverflow question with responses
An ancient databasejournal.com article still getting traffic
A blog posting by Chris Travers

Older Entries