April 26, 2018
There are several situations in which you will need to loop through data as a part of automating your stream. In this example, I will be developing regression models to predict sales using several inputs. I want to develop distinct models for each product category and sales region, which will require me to incorporate a data loop in my Jython script. This example also covers:
- Split Role disadvantages
- Using an Auto Numeric modeling node
- Defining stream parameters in a Jython Script
- Referring to stream parameters in the stream nodes
Figure 1 -This Type node shows the data fields and their roles. The roles set to “Split” here is for demonstration. I will change these to “None” later.
This example was developed using SPSS Modeler version 126.96.36.199. I am using a text file with obfuscated data from a real application. The data fields and roles are given in Figure 1. Notice that category and region roles are set to “Split.” Some modeling nodes are equipped to build models for split data, but not all. A full list of what nodes support splitting can be found in IBM Modeling Nodes documentation. Some process nodes are also rendered meaningless when working with split fields, like the Balance node. Furthermore, even when the nodes in your stream are fully capable of using splits, you may notice that performance drops dramatically when building models for each split.
I will use an Auto Numeric node, which builds models using all available algorithms and applies the most successful algorithms. The Auto Numeric node can use split fields, but in doing so it ignores the algorithms which cannot build models by splits, as demonstrated by this short experiment: After connecting the Auto Numeric node to the Type node, the Auto Numeric node’s Expert tab states that only 5 models are available. However, if I change the category and region roles in the Type node to be “None,” now 9 models are available. For this reason, I will keep the category and region roles set to “None.”
Figure 2 – Auto Numeric Expert tab: There are only 5 models available when a Split field is in the preceding Type node (even if the “Build models for each split” option is unchecked).
Figure 3 – Auto Numeric Expert tab: There are now 9 models available after changing Split roles to None in the preceding type node. Notice the lower section is no longer grayed.
The final stream is given in Figure 4. The first pipeline produces a list of distinct categories and regions which will be used to drive the modeling pipeline below. All parameters and node settings are given in Figure 5 through Figure 10.
NOTE ON USING TEXT FILES VS. DATABASE AS A SOURCE: The Vars source node, the Filter node, and the Distinct node within the first pipeline could be consolidated using a “SELECT DISTINCT ….” statement when sourcing from a database instead of a text file. Similarly, the Vars source node and the Select node would be consolidated using a “SELECT …… WHERE ….” statement when sourcing from a database, and the two pipelines would not be explicitly connected.
Figure 4 – Final pipeline used for this example.
Figure 5 – Stream parameters
Figure 6 – Keep Split Fields settings: I only need distinct values for my split fields, CATEGORY and REGION, so I filter the others.
Figure 7 – Distinct Split Fields settings: I want distinct combinations of my split fields, CATEGORY and REGION.
Figure 8 – Select settings: I am only using data for a single combination of split fields, as defined by the stream parameters CATEGORY and REGION. I would refer to the stream parameters the same way if I were writing a SQL statement: “SELECT … WHERE CATEGORY = ‘$P-CATEGORY’ AND REGION = ‘$P-REGION’.”
Figure 9 – Partition settings: I am using a standard 80-20 split. You typically (a.k.a. always) partition your data, and need a good reason to not partition data.
Figure 10 – Final Type node settings: Notice that CATEGORY and REGION are set to “None.”
How to Loop through Data in a Table Node
Figure 11 – Split_Fields node output
Results of running the output Table node is given in Figure 11. The jython script will loop through each row of this table, extract the category and region, redefine the stream parameters, and run the rest of the stream. The jython code that achieves this is given below. This is written in the “Execution” tab of the Stream Properties window.
Figure 12 – Execution tab of the stream properties window
import modeler.api, os, time, datetime
# Define stream variables
stream = modeler.script.stream()
modelingNode = stream.findByType(“autonumeric”,”SALES_QUANTITY”) # “autonumeric” refers to the Auto Numeric node. See IBM documentation for other nodes.
exportNode = stream.findByType(None, “Sales_Predictions”) # You can also reference nodes using only their name, as long as there is only 1 in the stream with that name.
#Get list of categories and regions
tableNode = stream.findByType(None, “Split_Fields”)
results = 
rowset = results.getRowSet()
# Loop through categories and regions and develop model for each, and score data for each
export_ran_once = 0
row = 0
row_count = rowset.getRowCount()
while row < row_count:
# Pull category and region from rowset and set parameters
category = rowset.getValueAt(row,0)
region = rowset.getValueAt(row,1)
# Run the model
modelResults = 
# Score the data and export results
if export_ran_once == 0: # Recreate the file/table on row 0
#exportNode.setPropertyValue(“write_mode”,”Create”) # When exporting to a database table, use Create… See IBM documentation for all nodes’ properties
exportNode.setPropertyValue(“write_mode”,”Overwrite”) # When exporting to a flat file, use Overwrite
export_ran_once = 1
else: # Append the subsequent results
# Advance to the next row
row = row + 1