A data step is a set of instructions (such as statements, functions and call routines) on how a data set is to be built. These instructuions are evaluated by the data step compiler.
A data steps starts with data and optional data set names.
The data step is used to perform data manipulation: an important purpose of the data step is to provide a means of reading external data and creating data sets for later use in procedures.
The data step language looks a bit like PL1.
Names to avoid for data sets: _NULL_, _DATA_, _LAST_.
Implied Loop of a Data Step
ILDS (Implied Loop of a Data Step): The data step performs an (implied) loop: for each observation, a set of statements is executed.
Additionally, there are the two variables _n_ and _error_ which are automatically generated for all data steps.
When the processing of an obesrvation finishes, the values within the PDV are written to the output destination (except for the automatic variables and the variables marked with drop).
Compilation and execution
A data step is processed in two phases: first the compilation phase and then the execution phase.
The compilation phase checks for the syntactical correctness of the code and then creates the PDV and the decriptor portion of the output data set.
The input data is then read during the execution phase.
Renaming a variable
A variable (column name) can be renamed with the rename option:
data tq84_in;
input
txt $3.
num 8.;
datalines;
foo 17
bar 9
bar 22
foo 86
baz 55
foo 6
bar 84
baz 21
bar 64
run;
proc sort data=tq84_in;
by txt;
run;
data tq84_out;
set tq84_in;
by txt;
/* Create a variable, named group_nr,
that increases for each new group of txt */
if first.txt then group_nr + 1;
run;
proc print data = tq84_out;
run;