My challenge: automate creating a JSON file that defines a Datameer import job operating on a fixed width text file. Why bother? Current state, the Datameer screen for working on a fixed width format with many columns can be time consuming.
Python and the Datameer REST API to the rescue.
I learned some basic python goodness while doing this.
1. Choose either single quotes or double quotes when assigning a value to a Python string variable. Super useful when the string contains one of those characters inside.
# This is a string. my_var = 'Long value with " inside of it' # And this is a string too. my_other_var = "this has a ' inside of it."
2. You can assign a variable to a multi-line string variable that contains all kinds of single and double quotes simply by enclosing the value with either triple single quotes or triple double quotes. Putting the letter r makes the string “raw”, allowing for other tricky character that would require an escape character before them.
my_var = r'''This string has " in it and is broken out over several lines and even has \ and ' inside of it.'''
3. When appending elements to a JSON array object, the python dict data structure solves tricky quoting issues. Also, note how the values True and False are capitalized. The python dict object needs these boolean values capitalized, but the json append method will convert them to true and false, which is the correct capitalization for JSON.
json_data["fields"].append(dict(id = counter, name =field_name, start = start_pos, origin = origin_str, chars = num_chars, end = end_pos, acceptEmpty = True, include = True, version = 3, valueType = field_type))
4. I used an older python string substitution method to concatenate variable values into a really long string. In the string value, you place the value %s. You can place this multiple times throughout the string. At the end of the string, list out, in order, the variables that will replace the %s markers.
my_var1 = "my_filename" my_var2 = "Employee Module" data = r'''{ "version": "9.0", "className": "com.department.employee", "file": { "path": "%s.exe", "description": "", "name": "%s" } }''' % (my_var1, my_var2)
The logic of the script is straightforward.
OPEN A FILE CONTAINING THE FIELD DEFINITIONS FOR THE FIXED WIDTH FILE
ASSEMBLE A JSON OBJECT CONTAINING ALL THE JSON NEEDED, MINUS THE SPECIFIC FIELD DEFINITIONS. CONCATENATE CUSTOMIZABLE PARTS LIKE PATH AND OBJECT_NAME.
ITERATE THROUGH THE FIELD DEF FILE AND ADD ALL THE FIELDS TO THE FIELD ARRAY IN THE JSON OBJECT
WRITE OUT THE COMPLETE JSON IMPORT JOB DEFINITION TO DISK
RUN THE REST API CALL TO CREATE A NEW IMPORT JOB USING THE DYNAMICALLY ASSEMBLED JSON
Here’s the script.
####################################################################################### # Python 2.7 script to create a JSON definition file for a Datameer import job # that operates on a fixed width data source. ####################################################################################### # Note that the output JSON file is not formatted in the order that Datameer would # order it. Correctly formatted JSON is not dependent on order. Python's JSON # library doesn't maintain the order the JSON was assembled in, but Datameer will # ingest the file correctly. When Datameer renders the JSON created by this script, it # will format it in the usual order. # Needs a field_def file that contains the following # 1. Field name # 2. Start position of field # 3. End position of field # 4. Data type of field # Python libraries needed for this script. import json import csv import sys import requests # Host and credentials datameer_server_url = 'http://localhost:7996/rest/import-job' user = 'admin' password = 'admin' # Check command line arguments. if len(sys.argv) != 8: print 'This script builds a Datameer JSON file for an import job using the fixed-width' print 'format. It takes 7 arguments:' print '1. The name of the object to create' print '2. The file that contains field defs' print '3. The name for the output JSON file' print '4. Include meta data fields? Y or N' print '5. Datameer path where you want to put object' print '6. Full Datameer path and name of Connection object to use' print '7. Source location, can include folders, filename and wild cards' print 'Example: $ python fixed_width_create.py my_name fdefs.csv out.json Y /Data/ImportJobs /Data/Connections/my_connection.dst fixedwidth/*' sys.exit() # Put command line arguments into variables object_name = sys.argv[1] field_def = sys.argv[2] output_filename = sys.argv[3] do_dasMetaFields = sys.argv[4] object_path = sys.argv[5] connection_path_and_file = sys.argv[6] source_path = sys.argv[7] # Initial JSON data that defines the import job. The fields array is empty. # Uses positional string replacement parameter: %s # This is a multi-line string that uses the 3 apostrophe notation. # Also, the letter r precedes the apostrophes to make the string "raw", which handles # all the characters that might otherwise need escape characters. data = r'''{ "version": "5.11.14", "className": "datameer.dap.common.entity.DataSourceConfigurationImpl", "file": { "path": "%s/%s.imp", "description": "", "name": "%s" }, "properties": { "GenericConfigurationImpl.temp-file-store": [ "6e67fa8f-7f24-4d4c-adbb-dddabe70d19a" ], "fileType": [ "FIXED_WIDTH" ], "filter.page.does.split.creation": [ "false" ], "file": [ "%s" ], "detectColumnDefinition": [ "SELECT_PARSE_AUTO" ], "ignore.first.n.lines": [ "0" ], "fileNameTimeRange_mode": [ "OFF" ], "filter.maxAge": [ "" ], "filter.minAge": [ "" ], "characterEncoding": [ "UTF-8" ], "recordSampleSize": [ "1000" ], "ObfuscatedColumns": [ "" ], "TextFileFormat": [ "TEXT" ], "incrementalMode": [ "false" ] }, "hadoopProperties": "", "dataStore": { "path": "%s" }, "errorHandlingMode": "DROP_RECORD", "maxLogErrors": 1000, "maxPreviewRecords": 5000, "notificationAddresses": "", "notificationSuccessAddresses": "", "fields": [ ] }''' % (object_path, object_name, object_name, source_path, connection_path_and_file) # Above line shows the values getting replaced. # Load the string python variable into a JSON object. json_data = json.loads(data) # Print out results to screen for debugging. print(json_data) # Open the list of fields/widths/data types with open(field_def,'r') as field_list: # Now read the variable with the csv library. csv_f = csv.reader(field_list) # Read past the header row. csv_f.next() # Variable to use in creating an ID value for columns in the JSON field array. counter = 0 # Assign column values to variables and get length value. for row in csv_f: #print row[2] counter = counter + 1 field_name = row[0] start_pos = int(row[1]) - 1 end_pos = int(row[2]) num_chars = end_pos - start_pos origin_str = str(start_pos) + "," + str(end_pos) data_type = row[3] field_type = '{\"type\":\"' + data_type + '\"}' # Use the variables to append to the JSON object. json_data["fields"].append(dict(id = counter, name =field_name, start = start_pos, origin = origin_str, chars = num_chars, end = end_pos, acceptEmpty = True, include = True, version = 3, valueType = field_type)) # Check if user wanted the meta fields. if do_dasMetaFields == 'Y': counter += 1 field_name = 'dasFileName' origin_str = 'fileInfo.fileName' field_type = '{\"type\":\"STRING\"}' json_data["fields"].append(dict(id = counter, name = field_name, origin = origin_str, acceptEmpty = False, include = True, version = 3, valueType = field_type)) counter += 1 field_name = 'dasFilePath' origin_str = 'fileInfo.filePath' field_type = '{\"type\":\"STRING\"}' json_data["fields"].append(dict(id = counter, name = field_name, origin = origin_str, acceptEmpty = False, include = True, version = 3, valueType = field_type)) counter += 1 field_name = 'dasLastModified' origin_str = 'fileInfo.lastModified' field_type = '{\"type\":\"DATE\"}' json_data["fields"].append(dict(id = counter, name = field_name, origin = origin_str, acceptEmpty = False, include = True, version = 3, valueType = field_type)) counter += 1 field_name = 'dasExecutionId' origin_str = 'jobexecutionInfo.id' field_type = '{\"type\":\"INTEGER\"}' json_data["fields"].append(dict(id = counter, name = field_name, origin = origin_str, acceptEmpty = False, include = True, version = 3, valueType = field_type)) counter += 1 field_name = 'dasJobExecutionStart' origin_str = 'jobexecutionInfo.starttime' field_type = '{\"type\":\"INTEGER\"}' json_data["fields"].append(dict(id = counter, name = field_name, origin = origin_str, acceptEmpty = False, include = True, version = 3, valueType = field_type)) # Write JSON data into a file using variable names from top of script. # Use the method called json.dump() # It's just dump() and not dumps() # Encode JSON data with open(output_filename, 'w') as f: json.dump(json_data, f) # Also, create a new Datameer object within Datameer using the Datameer REST API ret_val = requests.post(datameer_server_url, auth=(user, password), data = json.dumps(json_data))