Tuesday, April 10, 2012

Overcome performance issues using FME command line and Java

keywords: command line, FME, large datasets, database, SQL, Java

In this post I’ve described procedure how to use FME command line functionality to overcome some performance issue with large datasets.

… just a few words about FME command line ….

FME workbench can be played using FME workbench too or simply by using command line. You can see exact command in the translation log window.


fme.exe my_workbench.fmw
         --SourceDataset_CSV C:\my_data.txt
         --DestDataset_XLS_ADO C:\my_export.xls
First parameter is command (starts FME engine), second parameter is workbench name, after that can be arbitrarily number of parameters as it is defined in workbench file (user published parameters) but some of them are published by default like source or destination dataset (check for FME documentation for more details).
In this sample command the name of source dataset parameter for csv file is SourceDataset_CSV and the name of destination dataset parameter for xls file is DestDataset_XLS_ADO.

Create your workbench file and play it. Examine your own translation log window and see what parameters are expressed in command line. That is going to be your starting point for optimisation.

I don’t want to say that using proposed procedure you can solve all performance issues but some of them you will surely overcome, especially if your task has similarities with my case.

… the case …

I have to calculate mean, median and mode height for all Croatian LPIS parcels (land parcel information system). The source data were:
  • height data: dgn files for all country (DTM: points and breaklines; accuracy: +/- 2m).  To get more feeling on the data size you can immagine height grid of 25m x 25m that covers area of 56.542square km.
  • LPIS polygons (number of polygons > 1 000 000)

In the workbench file I had to overlay all LPIS polygons and DTM and than I had to calculate mean, median, mode for every parcel as well as accuracy. The procedure to solve this task is simple but to do calculation for whole country it takes a lot of time. The conceptual workflow is presented on the next figure.

From my point of view if you are working with large datasets and large number of files it is always good idea to upload your data to the database. In fact, this post assumes that you have access to some database system.

I work always in that way, from experience I can say that database will save you time and that conversion from file to database system is worth effort. In my work I use mostly Oracle database (sometimes PostGIS). For the sake of convenience, if you have to refer to original files later in process, during conversion from file to database system you may consider to add format attributes (fme_basename, fme_dataset, fme_feature_type).

Conversion from some file system to database system can be described by following schema.



You may consider some other approach depending on the source data format and data model. Proposed workflow is sufficient for CAD files (sometimes you have to expose color, style, weight and store it as attribute, just like other format attributes in step 2).

Let me go back to the first schema. Reading DTM files from file or even from the database can be time consuming. In the case of Croatian DTM I had to wait many hours (I don’t remember exactly but I’m speaking of period longer than 10 hours). Reading LPIS polygons wasn’t issue (step 2). Try to imagine how long will take to overlap all LPIS polygons and DTM, a how much RAM you should have to perform such task. It is not hard to see that such task can not be easily solved. In some cases it can be impossible to find solution without process redesign.

… redesign …

The goal is to make calculation (mean, median, mode for height) on the parcel level (one polygon). Parcel defines area of interest and DTM data provide height information. The workflow for successful problem solution can be described as follows.


Posted Java application creates two database view that play rule of the source dataset in the workbench file that is later executed using command line approach in the same application. To be honest this application can be written in many programing languages but for this particular task I decide to use Java. You just have to create workbench file that connects to database views. Here is the java code... (to run this application you need to have oracle jdbc driver).


package my.util.fme;

import java.sql.*;
import java.io.*;

public class CreateBatch {
 
  public static void main(String[] args) 
      throws ClassNotFoundException, SQLException, IOException
  {
    
 String fmeDat, datasource, username, password, fileName;
 String q1, q2, q3;
 
 Runtime rt;
 Process pr;
 int exitVal;
 String command;
 
 Process prcs;
 InputStreamReader isr ;
 BufferedReader br ;
 String line; 
 
 try{
  if(args.length==5){
   
   fmeDat = args[0];
   datasource = args[1];
   username = args[2];
   password = args[3];
   fileName = args[4];
   
   FileWriter fstream; 
   BufferedWriter out;
   
   Class.forName("oracle.jdbc.driver.OracleDriver");
   
      String url = "jdbc:oracle:thin:@oracle1.geofoto.lan:1521:" + datasource;
             
      Connection conn = 
           DriverManager.getConnection(url,username, password);
   
      conn.setAutoCommit(false);
   
      Statement stmt = conn.createStatement();
   Statement stmt1 = conn.createStatement();
   
      //step 1 in proposed workflow
   ResultSet rset =  stmt.executeQuery("select lp.id from lpis_parcels lp order by id");
   
      ResultSet rset1, rset2, rset3;
   
      while (rset.next()) {
   
    fstream = new FileWriter(fileName);
    out = new BufferedWriter(fstream);
        
    //create view with one LPIS parcel
    //step 1 in proposed workflow
    q1="create or replace view v_lpis_parcels as " +
    "select  lp.* from lpis_parcels lp where lp.id=" + rset.getString(1);
    
    //create view with DTM points that have anyinteract topological relation with current parcel -> rset.getString(1)
    //step 2 in proposed workflow
    q2="create or replace view v_dtm_points as " +
    "select dtm.* from dtm_points dtm where sdo_anyinteract(dtm.geom, select lp.geom from v_lpis_parcels) = 'TRUE'";
    
    //create view with DTM lines that have anyinteract topological relation with current parcel -> rset.getString(1)
    //step 2 in proposed workflow
    q3="create or replace view v_dtm_lines as " +
    "select dtm.* from dtm_lines dtm where sdo_anyinteract(dtm.geom, select lp.geom from v_lpis_parcels) = 'TRUE'";
    
    rset1 =  stmt1.executeQuery(q1);
    rset2 =  stmt1.executeQuery(q2);
    rset3 =  stmt1.executeQuery(q3);
    
    //step 3 & 4: in proposed workflow these steps are performed inside fme workbench file
    command = "fme.exe " + 
         fmeDat + 
         " --SourceDataset_ORACLE8I " + 
         datasource +
         " --DestDataset_ORACLE8I " +
         datasource;  
    
    //create batch file: contains only one command; it is recreated in every iteration
    //probably it can be sloved without file creation but this works and I didn't find another way to execute fme.exe (command)
    //step 3 & 4: in proposed workflow these steps are performed inside fme workbench file
    out.write(command);
    out.write('\n');
    out.close();
    
    //executed batch file created in previous step
    //step 3 & 4: in proposed workflow these steps are performed inside fme workbench file
    rt = Runtime.getRuntime();     
    prcs = rt.exec(fileName);   
    isr =  new InputStreamReader( prcs.getInputStream() );         
    br = new BufferedReader(  isr ); 
       
    while  ((line = br.readLine()) != null)
      System.out.println(line); 
      }
   
      stmt.close();
   stmt1.close();

  }else{
   System.out.println ("");
   System.out.println ("java CreateBatch     ");  
   System.out.println ("");
   System.out.println ("   FME workbench file: *.fmw");  
   System.out.println ("   Oracle Service Name: Ask Database Administrator!");  
   System.out.println ("   Username: Ask Database Administrator!");  
   System.out.println ("   Password: Ask Database Administrator!");  
   System.out.println ("   File Name: Name of batch file created during process");  
  }
 }
 catch(SQLException e){
 
  System.out.println("------------------------------");
  System.out.println("SQLException");
  System.out.println("------------------------------");
  System.out.println(e);
 }
 catch(ClassNotFoundException e){
  System.out.println("------------------------------");
  System.out.println("ClassNotFoundException");
  System.out.println("------------------------------");
  System.out.println(e);  
 }
 catch(IOException e){
  System.out.println("------------------------------");
  System.out.println("IOException");
  System.out.println("------------------------------");
  System.out.println(e);  
 }
  }
}


In the nutshell:
  • prepare FME workbench that connects to the database view/s (keep in mind that view/s are created on the fly using application)
  • using starter application (java, python, c++, ….) create database view/s with minimum data needed for translation process using FME
  • invoke FME using  starter application (in the same run as previous step)
  • loop process until you solve task for whole dataset
  • store result of every iteration for later reporting and analysis

… to conclude …

Optimisation of described case is reached by breaking process into smaller units (working with one parcel at the time) and by the fact that the FME workbench doesn’t need to read whole dataset at once because it reads data from views that are created on the fly using posted java application. Java application glues two separate processes: data preparation (SQL) and execution of fme workbench (FME command line)  as it is presented on the workflow diagram.

enjoy!

No comments:

Post a Comment