Monday, December 3, 2012

Automation of Raster/grid data processing using ArcGIS and python

key words: automation, processing, python, ArcGIS, arcpy, xml, minidom

The main idea for this post was to define way how to process datasets that are stored as a files and organized in multiple folders. To be more specific I will describe some particular case.

Task: Create GeoTiff files from delivered ASCII files (x,y,z). ASCII files are organised in folders: the root folder is “grid_data” and there are several subfolders grouping the data according to resolution (5m grid, 2m grid). There is no coordinate system definition attached to files. Coordinate system definition is delivered in separated pdf document as well as information about grid resolution.

Before processing you always have to make some preparation especially if you want to make you processing autonomous and not dependent to operator. In this case I’ve decided to create xml file (info.xml) that will hold some basic information about delivered data as well as some other information about desired parameters for processing. Here is sample file.
<?xml version="1.0" encoding="UTF-8"?>
  <!-- dataset resolution in m --> 
  <!-- name of ESRI prj file -->
  <!-- Root folder for processed files. Leave empty if you want to use same folder as input data -->
  <!-- name of ESRI prj file. Leave empty if you want to use same srs as input data -->
I have decided to use xml for this purpose because it is more flexible when you need to make some changes (if you want to use similar approach to solution of some other task). To parse this xml file I have created following python class:
Created on 23. nov. 2012

@author: Stipica Pavicic
from xml.dom.minidom import parseString

configuration sample file m:\python\info.xml

class ConfigObj(object):
    Parsing of xml file that holds configuration parameters for porcessing.
    Parameters are separated in two sections <input_data> and <results>.
    class InputObj(object):
        parsing of <input_data> segment of info.xml file
        _res = 'not_set'
        _srs = 'not_set'
        def __init__(self, dom_obj):
            node = dom_obj.getElementsByTagName('input_data')[0]
            dom_elem = node.getElementsByTagName('srs')[0].firstChild
            if dom_elem:
                self._srs =

            dom_elem = node.getElementsByTagName('res')[0].firstChild
            if dom_elem:
                self._res =
        def getSrs(self):
            return self._srs
        def getRes(self):
            return self._res 
    class ResultObj(object):
        parsing of <results> segment of info.xml file
        def __init__(self, dom_obj):
            node = dom_obj.getElementsByTagName('results')[0]
            dom_elem = node.getElementsByTagName('srs')[0].firstChild
            if dom_elem:
                self._srs =
                # in the case if result srs is not defined in xml file use the input data srs 
                nodei = dom_obj.getElementsByTagName('input_data')[0]
                dom_elemi = nodei.getElementsByTagName('srs')[0].firstChild
                self._srs =                
            dom_elem = node.getElementsByTagName('root_folder')[0].firstChild
            if dom_elem:
                self._root_folder =
        def getSrs(self):
            return self._srs
        def getRootFolder(self):
            return self._root_folder 

    def __init__(self, info_xml_path):

            myfile = open(info_xml_path,'r')
            #convert to string:
            mydata =
            #close file because we dont need it anymore:
            #parse the xml you got from the file
            self._dom = parseString(mydata)
            self._result_obj = self.ResultObj(self._dom)
            self._input_obj = self.InputObj(self._dom)
        except IOError:
            print("IO Error")
            print("General Error!")
    def getDom(self):
        return self._dom
    def getGetResultParam(self):
        access  parameters
        return self._result_obj
    def getGetInputParam(self):
        access  parameters
        return self._input_obj
To use this class you need to create xml file with configuration parameters (eg. c:\test\info.xml). Here is the sample code how to access parameters:
from my.xml.configurator import ConfigObj

conf = ConfigObj(r'c:\test\info.xml')
print conf.getGetInputParam().getSrs()
print conf.getGetInputParam().getRes()
print conf.getGetResultParam().getSrs()
print conf.getGetResultParam().getRootFolder()
As you can see it is very easy to adjust python code to some different structured xml file. Maybe you need to add additional parameters inside <input> section, or maybe you want to add additional section for some other group of parameters (something like <global>). All this depends on the issue you want to solve using this approach. For this particular case I’ll stick with the parameters that already exists in presented xml file ..I’m calling it info.xml but it can be any other name :-) So... for now you know how to create configuration file, to create some parameters and values, and how to access those parameters using python code (class ConfigObj). Next step is to create script that process all files in all subfolders starting from some arbitrary root folder. Script will search for configuration files (info.xml) and processing will be performed only for those folders. Here is code for that. The processing is straightforward but I will make some comments just to explain idea. Some comments are also included in source code.
Created on 23. nov. 2012
@author: Stipica Pavicic

from my.xml.configurator import ConfigObj
from my.process_timer import Timer

import os, fnmatch
import arcpy
import random
import shutil
from arcpy.conversion import PointToRaster


def createTmpGDB(folder):
    '''creates temporary file geodatabase'''
    #temp geodatabase name
    tmp_gdb_name = 'tempXXX.gdb'
    #redefine name if already exists
    while os.path.exists(folder + os.sep + tmp_gdb_name):
        tmp_gdb_name = tmp_gdb_name.replace('XXX',str(random.randint(1000000, 9000000)))
    arcpy.CreateFileGDB_management(folder, tmp_gdb_name )
    return tmp_gdb_name

def deleteTmpGDB(tmp_gdb):
    '''removes temporary file geodatabase'''    
    return 0 

def proces_grid_file(in_file_folder, out_file_folder, resolution, in_srs, out_srs, filename, temporery_gdb):
        fbasename, extension = os.path.splitext(filename)
        in_csv = in_file_folder + os.sep + filename
        #point_data = out_file_folder + os.sep + temporery_gdb + os.sep + "tp" + fbasename 
        #basename can contain ilegal characters for arcpy that is why I use random name (for this particular case the name of intermediate results is not important 
        point_data = out_file_folder + os.sep + temporery_gdb + os.sep + "tp" + str(random.randint(1000000, 9000000))
        while os.path.exists(point_data):
            #redefine point_data file name if already exists
            point_data = out_file_folder + os.sep + temporery_gdb + os.sep + "tp" + str(random.randint(1000000, 9000000))
        out_tif = out_file_folder + os.sep + fbasename + ".tif"

        arcpy.ASCII3DToFeatureClass_3d(in_csv, "XYZ", point_data, "POINT", "", out_srs, "", "", "DECIMAL_POINT")
        arcpy.PointToRaster_conversion(point_data , "Shape.Z", out_tif, "MOST_FREQUENT", "NONE", str(resolution))
        #process finished without exception
        return 1
    except Exception as e:
        print e
        #process finished with exception
        return 0

for (root, dirs, files) in os.walk(r'C:\sp\python\grid_data'):

    measure = Timer()   
    # get parameters from xml file (conf) and define configuration for processing the files in folder (root)
    if config_file_name in files:
        print '*****************************************************************************'
        print '>>>> Root    : ', root
        print 'Nr. of files : ' , len(fnmatch.filter(files,filter))  # nuber of files in folder that match filter expression 
        print 'Config file  : ' , root + os.sep + config_file_name
        conf = ConfigObj(root + os.sep + config_file_name)
        #open prj files
        inputSrsFile   = open(conf.getGetInputParam().getSrs(),'r')
        resultSrsFile  = open(conf.getGetResultParam().getSrs(),'r')
        #read files
        inputSrs  =
        resultSrs =  
        #close files 
        print 'spatial reference system definition'
        print 'input data srs :' , inputSrs
        print 'results    srs :'  , resultSrs   
        # create temporary file geodatabase
        tmp_gdb = createTmpGDB(root)
        # loop though list of files that match filter criteria (filter) & process files
        cnt = 0
        for filename in fnmatch.filter(files,filter):
            print '-----------------------------------------------------------------------------'
            cnt = cnt + 1    
            print cnt, ' -> processing file: ', filename
            fbasename, extension = os.path.splitext(filename)        
            #print root, '||', filename, '||', fbasename, '||', extension
            r=proces_grid_file(root, root, conf.getGetInputParam().getRes(), inputSrs, resultSrs, filename, tmp_gdb)
            if r:
                # r=0 process is finished without exception
                print filename, 'processing finished!'
                # r=1 process is finished with exception
                print filename, ': file is not processed!'
            print filename, measure.getTimeDifference()
        #delete temporary file geodatabase
        deleteTmpGDB(root + os.sep + tmp_gdb)
        # total time for processing
        print measure.getElapsedlTime()
Script starts at the line 59 but before that I have included some helper functions:
  • to create temporary file geodatabase (line 17)
  • to remove temporary file geodatabase (line 30)
  • file processing function (line 36)

... how to use this code?

Firstly you need to look how the files are organised. If they are stored in folder by grouping according to some joint property there is a good chance that this code can be useful to you, no matter if you need to perform the same processing (as in this post) or not. For this particular case files were stored in folders according to the resolution and the spatial reference system, ie. files with the same resolution and the same spatial reference system should be in the same folder. You can split files with the same properties in multiple folders but you cannot put the files with the different properties in the same folder.

Secondly, for all folders that you want to process, create one xml file with the properties. Use the same name for all xml files (see the code line 62; info.xml). Folders without configuration file won’t be processed.

Thirdly, change the filter you want to use for your case (line 63). Files I had to process were delivered as csv files.

And finally, run the script. To be able to run it you will have to establish working enviroment (all import should be accessible by script). To measure processing time I have also used class described in one of my previous posts. If you don’t want to use it just comment lines 7, 61, 110 and 116 (all lines related to Timer class).

Feel free to use this code and to modify it according to your needs. I hope that I will make a new post to show you how to modify existing code and how to use it by following this concept: walking through the folders and processing files based on the xml configuration file.


... enjoy!

1 comment:

  1. I like your blog, I read this blog please update more content on python, further check it once at python online course