key words: automation, processing, python, ArcGIS, arcpy, xml, minidom
The main idea for this post was to define way how to process datasets that are stored as a files and organized in multiple folders. To be more specific I will describe some particular case.
Task: Create GeoTiff files from delivered ASCII files (x,y,z). ASCII files are organised in folders: the root folder is “grid_data” and there are several subfolders grouping the data according to resolution (5m grid, 2m grid). There is no coordinate system definition attached to files. Coordinate system definition is delivered in separated pdf document as well as information about grid resolution.
Before processing you always have to make some preparation especially if you want to make you processing autonomous and not dependent to operator. In this case I’ve decided to create xml file (info.xml) that will hold some basic information about delivered data as well as some other information about desired parameters for processing. Here is sample file.
<?xml version="1.0" encoding="UTF-8"?>
<geoprocessing>
<input_data>
<!-- dataset resolution in m -->
<res>5</res>
<!-- name of ESRI prj file -->
<srs>C:\sp\python\grid_data\srs\ED_1950_UTM_Zone_31N.prj</srs>
</input_data>
<results>
<!-- Root folder for processed files. Leave empty if you want to use same folder as input data -->
<root_folder></root_folder>
<!-- name of ESRI prj file. Leave empty if you want to use same srs as input data -->
<srs></srs>
</results>
</geoprocessing>
I have decided to use xml for this purpose because it is more flexible when you need to make some changes (if you want to use similar approach to solution of some other task). To parse this xml file I have created following python class:
'''
Created on 23. nov. 2012
@author: Stipica Pavicic
'''
from xml.dom.minidom import parseString
'''
configuration sample file m:\python\info.xml
'''
class ConfigObj(object):
'''
Parsing of xml file that holds configuration parameters for porcessing.
Parameters are separated in two sections <input_data> and <results>.
'''
_dom=''
_result_obj=''
_input_obj=''
class InputObj(object):
'''
parsing of <input_data> segment of info.xml file
'''
_res = 'not_set'
_srs = 'not_set'
def __init__(self, dom_obj):
node = dom_obj.getElementsByTagName('input_data')[0]
dom_elem = node.getElementsByTagName('srs')[0].firstChild
if dom_elem:
self._srs = dom_elem.data
dom_elem = node.getElementsByTagName('res')[0].firstChild
if dom_elem:
self._res = dom_elem.data
def getSrs(self):
return self._srs
def getRes(self):
return self._res
class ResultObj(object):
'''
parsing of <results> segment of info.xml file
'''
_root_folder='not_set'
_srs='not_set'
def __init__(self, dom_obj):
node = dom_obj.getElementsByTagName('results')[0]
dom_elem = node.getElementsByTagName('srs')[0].firstChild
if dom_elem:
self._srs = dom_elem.data
else:
# in the case if result srs is not defined in xml file use the input data srs
nodei = dom_obj.getElementsByTagName('input_data')[0]
dom_elemi = nodei.getElementsByTagName('srs')[0].firstChild
self._srs = dom_elemi.data
dom_elem = node.getElementsByTagName('root_folder')[0].firstChild
if dom_elem:
self._root_folder = dom_elem.data
def getSrs(self):
return self._srs
def getRootFolder(self):
return self._root_folder
def __init__(self, info_xml_path):
try:
myfile = open(info_xml_path,'r')
#convert to string:
mydata = myfile.read()
#close file because we dont need it anymore:
myfile.close()
#parse the xml you got from the file
self._dom = parseString(mydata)
self._result_obj = self.ResultObj(self._dom)
self._input_obj = self.InputObj(self._dom)
except IOError:
print("IO Error")
except:
print("General Error!")
def getDom(self):
return self._dom
def getGetResultParam(self):
'''
access parameters
'''
return self._result_obj
def getGetInputParam(self):
'''
access parameters
'''
return self._input_obj
To use this class you need to create xml file with configuration parameters (eg. c:\test\info.xml). Here is the sample code how to access parameters:
from my.xml.configurator import ConfigObj
conf = ConfigObj(r'c:\test\info.xml')
print conf.getGetInputParam().getSrs()
print conf.getGetInputParam().getRes()
print conf.getGetResultParam().getSrs()
print conf.getGetResultParam().getRootFolder()
As you can see it is very easy to adjust python code to some different structured xml file. Maybe you need to add additional parameters inside <input> section, or maybe you want to add additional section for some other group of parameters (something like <global>). All this depends on the issue you want to solve using this approach. For this particular case I’ll stick with the parameters that already exists in presented xml file ..I’m calling it info.xml but it can be any other name :-)
So... for now you know how to create configuration file, to create some parameters and values, and how to access those parameters using python code (class ConfigObj).
Next step is to create script that process all files in all subfolders starting from some arbitrary root folder. Script will search for configuration files (info.xml) and processing will be performed only for those folders. Here is code for that. The processing is straightforward but I will make some comments just to explain idea. Some comments are also included in source code.
'''
Created on 23. nov. 2012
@author: Stipica Pavicic
'''
from my.xml.configurator import ConfigObj
from my.process_timer import Timer
import os, fnmatch
import arcpy
import random
import shutil
from arcpy.conversion import PointToRaster
arcpy.CheckOutExtension("3D")
def createTmpGDB(folder):
'''creates temporary file geodatabase'''
#temp geodatabase name
tmp_gdb_name = 'tempXXX.gdb'
#redefine name if already exists
while os.path.exists(folder + os.sep + tmp_gdb_name):
tmp_gdb_name = tmp_gdb_name.replace('XXX',str(random.randint(1000000, 9000000)))
arcpy.CreateFileGDB_management(folder, tmp_gdb_name )
return tmp_gdb_name
def deleteTmpGDB(tmp_gdb):
'''removes temporary file geodatabase'''
shutil.rmtree(tmp_gdb)
return 0
def proces_grid_file(in_file_folder, out_file_folder, resolution, in_srs, out_srs, filename, temporery_gdb):
try:
fbasename, extension = os.path.splitext(filename)
in_csv = in_file_folder + os.sep + filename
#point_data = out_file_folder + os.sep + temporery_gdb + os.sep + "tp" + fbasename
#basename can contain ilegal characters for arcpy that is why I use random name (for this particular case the name of intermediate results is not important
point_data = out_file_folder + os.sep + temporery_gdb + os.sep + "tp" + str(random.randint(1000000, 9000000))
while os.path.exists(point_data):
#redefine point_data file name if already exists
point_data = out_file_folder + os.sep + temporery_gdb + os.sep + "tp" + str(random.randint(1000000, 9000000))
out_tif = out_file_folder + os.sep + fbasename + ".tif"
arcpy.ASCII3DToFeatureClass_3d(in_csv, "XYZ", point_data, "POINT", "", out_srs, "", "", "DECIMAL_POINT")
arcpy.PointToRaster_conversion(point_data , "Shape.Z", out_tif, "MOST_FREQUENT", "NONE", str(resolution))
#process finished without exception
return 1
except Exception as e:
print e
#process finished with exception
return 0
for (root, dirs, files) in os.walk(r'C:\sp\python\grid_data'):
measure = Timer()
config_file_name='info.xml'
filter='*.csv'
# get parameters from xml file (conf) and define configuration for processing the files in folder (root)
if config_file_name in files:
print '*****************************************************************************'
print '>>>> Root : ', root
print 'Nr. of files : ' , len(fnmatch.filter(files,filter)) # nuber of files in folder that match filter expression
print 'Config file : ' , root + os.sep + config_file_name
conf = ConfigObj(root + os.sep + config_file_name)
#open prj files
inputSrsFile = open(conf.getGetInputParam().getSrs(),'r')
resultSrsFile = open(conf.getGetResultParam().getSrs(),'r')
#read files
inputSrs = inputSrsFile.read()
resultSrs = resultSrsFile.read()
#close files
inputSrsFile.close()
resultSrsFile.close()
print 'spatial reference system definition'
print 'input data srs :' , inputSrs
print 'results srs :' , resultSrs
# create temporary file geodatabase
tmp_gdb = createTmpGDB(root)
# loop though list of files that match filter criteria (filter) & process files
cnt = 0
for filename in fnmatch.filter(files,filter):
print '-----------------------------------------------------------------------------'
cnt = cnt + 1
print cnt, ' -> processing file: ', filename
fbasename, extension = os.path.splitext(filename)
#print root, '||', filename, '||', fbasename, '||', extension
r=proces_grid_file(root, root, conf.getGetInputParam().getRes(), inputSrs, resultSrs, filename, tmp_gdb)
if r:
# r=0 process is finished without exception
print filename, 'processing finished!'
else:
# r=1 process is finished with exception
print filename, ': file is not processed!'
print filename, measure.getTimeDifference()
#delete temporary file geodatabase
deleteTmpGDB(root + os.sep + tmp_gdb)
# total time for processing
print measure.getElapsedlTime()
Script starts at the line 59 but before that I have included some helper functions:
- to create temporary file geodatabase (line 17)
- to remove temporary file geodatabase (line 30)
- file processing function (line 36)
... how to use this code?
Firstly you need to look how the files are organised. If they are stored in folder by grouping according to some joint property there is a good chance that this code can be useful to you, no matter if you need to perform the same processing (as in this post) or not. For this particular case files were stored in folders according to the resolution and the spatial reference system, ie. files with the same resolution and the same spatial reference system should be in the same folder. You can split files with the same properties in multiple folders but you cannot put the files with the different properties in the same folder.
Secondly, for all folders that you want to process, create one xml file with the properties. Use the same name for all xml files (see the code line 62; info.xml). Folders without configuration file won’t be processed.
Thirdly, change the filter you want to use for your case (line 63). Files I had to process were delivered as csv files.
And finally, run the script. To be able to run it you will have to establish working enviroment (all import should be accessible by script). To measure processing time I have also used class described in one of my previous posts. If you don’t want to use it just comment lines 7, 61, 110 and 116 (all lines related to Timer class).
Feel free to use this
code and to modify it according to your needs. I hope that I will make a new post to show you how to modify existing code and how to use it by following this concept: walking through the folders and processing files based on the xml configuration file.
.....
... enjoy!