SAS and SQL cheat sheet


SAS


Basic data manipulation


Creating new variables in datasets

Suppose we want to create a copy, called SPXTSX_numbered, of an existing dataset FinData.SPXTSX (the name of the dataset shows that it is a permanent dataset stored in the folder/library FinData; if we used a non-permanent dataset, the name would be SPXTSX), and in the new dataset we want to have an additional variable row_num that simply counts sequentially the observations. The value of row_num will coincide with the counter that SAS automatically generates and puts on the left of every printed (with proc print) dataset (if the noobs option of proc print is used, the counter is not printed). The advantage of introducing a variable like row_num is that row_num is part of the dataset and can be access and used whenever necessary. For example, the counter _n_ that SAS creates (see the code snippet below) can be used in an if statement, but not in a where statement. The latter will give an error that the dataset (whichever dataset is active) does not contain a variable named _n_.

/* Because the datasets are relatively large, 
   use only the first 10 observations
   in all the datasteps that follow. */
options obs = 10;

data SPXTSX_numbered;

 set FinData.SPXTSX;
 row_num = _n_;

run;

/* Print the original dataset. */
proc print data = FinData.SPXTSX;
 title 'SPXTSX data';
run;

/* Print the numbered copy of the dataset,
   without the default counter. */
proc print data = SPXTSX_numbered noobs;
 title 'SPXTSX data';
run;

options obs = max;
spxtsx_data.png

The original dataset.

spxtsx_data_numbered.png

A copy of the original dataset with an additional variable row_num introduced. The new variable simply counts the number of observations row by row.

If we do not want to create a new dataset, but, instead, just add the new variable to an existing dataset (if the original dataset is large, creating a copy will cause a significant overhead), then we can use the name of the existing dataset in the set statement:

data FinData.SPXTSX;
    
    set FinData.SPXTSX;
    row_num = _n_;

run;

 

Stacking datasets on top of each other

This is simply done with a set statement:

/* Extract data in chunks of 5 */
data SPXTSX_5_a;

set FinData.SPXTSX;
 if _n_ < 6;
 
run;

data SPXTSX_5_b;

set FinData.SPXTSX;
 if _n_ > 5 and _n_ < 11;
 
run;

data SPXTSX_5_c;

set FinData.SPXTSX;
 if _n_ > 10 and _n_ < 16;
 
run;

/* Stack the three datasets on top of each other */
data SPXTSX_stacked;

set SPXTSX_5_a SPXTSX_5_b SPXTSX_5_c;
 
run;

Using successive chunks of 5 observations each (see the last code snippet), the result is

spxtsx_stacked.png


Macros


Macro definition

Define a macro named ‘DBGPRNT’ that takes one (mandatory) argument. If the macro variable ‘DEBUG’, which serves as a boolean flag, has been set equal to 1, then the value of ‘message’ is printed.

%MACRO DBGPRNT(message);

%IF &DEBUG = 1 %THEN
 %DO;
 print &message;
 %END;
 
%MEND DBGPRNT;

 

Include a macro in a SAS program

To invoke a macro, SAS has to be told where is the file that contains the macro definition.

%include "/folder1/.../folderN/Macros/DBGPRNT.sas";

 

Macro invocation

The macro ‘DBGPRNT’, defined above, is invoked as follows

%let x = hello;
%DBGPRNT(hello);

 

Emulate iteration over a list of strings

A program that reads a number of files whose names are stored in a macro variable.

* Show macro steps in the log;
options mprint;
* Path of macro location;
%let macro_location=/folders/myfolders/Macros/;
* Path of data location;
%let data_location=/folders/myfolders/Work/Data/;
* Include macro;
%include "&macro_location.import_csv.sas";

* Number of observations to read from each .csv file;
%let num_obs = 21;
* Names of files (without extension) and of datasets;
%let names=SP500 NASDAQ RUSSELL2000 Apple VIX;

* Import the data from the .csv files;
%import_csv(filenames_=&names,
            file_location_=&data_location.Financial/,
            num_obs_=&num_obs);

The macro that implements the logic.

* This macro imports a list of .csv files.;

/* filenames_ : list of filenames, without the extensions */
/* file_location_ : the path of the input files */
/* num_obs_ : number of observations to read */

%macro import_csv(filenames_, file_location_, num_obs_);
 
/* proc import does not have options for how many observations 
to read so we set the global option */
options obs=&num_obs_;
 
* For each name in the list of names;
%do k=1 %to %sysfunc(countw(&filenames_));
 
%let name=%scan(&filenames_, &k);
 
proc import file="&file_location_&name..csv" out=&name replace;
run;
 
%end;
 
* Set the global option back to its default value;
options obs=max;

%mend; *import_csv;

IML


Basic operations on IML arrays

proc iml;

A = {1, 2, 3, 4, 5}; /* Matrix literal: 5 rows, 1 column */
B = {1 2 3 4 5}; /* Matrix literal: 1 row, 5 columns */
C = 2*B;
D = B*A; /* Matrix product */

/* 
Element-wise product. The dimensions of the matrices have to be the 
same
*/
E = B#A`;

/* The squares of the integers from 1 to 10 */
F = 1:20;
FF = F#F;

/* Sum of the integers contained in F */
G = {[20] 1}`; /* A 20x1 vector of ones */
GG = F*G;
/* Built-in sum command */
built_in_sum = sum(F);

print A, B, C, D, E, 'squares', F, FF, 'sum as inner product', GG;
print 'built-in sum command', built_in_sum;

xy = F`||FF`; /* Horizontal concatenation */
call pgraf(xy, '.', 'x', 'y', 'y = x^2'); /* Primitive graph */

/* Sequences
 Create two sequences, transpose them to column vectors,
 and concatenate them. */
H = do(1, 5, 0.5)`||do(6, 10, 0.5)`;
print H[format=8.2];

/* Reduction */
/* Create a 5x5 matrix using the values in the matrix literal */
Z = shape({1 2 3 4 5}, 5, 5);
ZZ = Z[, <>]; /* Maximum in each column */
ZZZ = Z[<>, ]; /* Maximum in each row */
/*
Sum of squares of the elements of the reduced matrix
(repeated reduction)
*/
Y = Z[<>, ][##];
print 'shape({1 2 3 4 5}, 5, 5)', Z, 'Z[, <>]', ZZ, 'Z[<>, ]', ZZZ,
      'Z[<>, ][##]', Y;
X = sum({1, 2, 3, 4, 5});
print X;

quit;

 

Create IML arrays from datasets

PROC IML;
 
 /*
 In the opposite direction, create IML variables from the dataset
 that contains the point
 coordinates
 */
 use clusters;    /* clusters is the name of the dataset */
 read all var {X_A, Y_A, X_B, Y_B} into C; /* C is the IML array */
 close clusters;

QUIT;

 

Create datasets from IML arrays

%macro create_dataset_from_IML_vars(x =, y =, name =, add_obs =);

%IF &add_obs = 0 %THEN
 %DO;
 CREATE &name VAR {&x &y};
 append;
 close &name;
 %END;
%ELSE
 %DO;
 num_obs = NROW(&x);
 obs = do(1, num_obs, 1)`;
 
 CREATE &name VAR {obs &x &y};
 append;
 close &name;
 %END;

%mend create_dataset_from_IML_vars;

 

Create an IML module and store it permanently

PROC IML;

 START generate_random_2D_points(X_coord, Y_coord, num_points, mean,
                                 cov_mat);
 
 P = RANDNORMAL(num_points, mean, cov_mat);
 X_coord = P[, 1];
 Y_coord = P[, 2];
 
 /* IF data_set_flag = 1 THEN
 DO;
 CREATE dataset_from_IML_vars VAR {X_coord Y_coord};
 APPEND;
 CLOSE dataset_from_IML_vars;
 END; */
 
 return;
 
 FINISH generate_random_2D_points;
 
 libname Modules "/folders/myfolders/Modules";
 reset storage = Modules.kNN_2D;
 store module = generate_random_2D_points;

QUIT;

 

Load an IML module

PROC IML;
 
 /*
 Load module that generates normally distributed points in the plane.
 */

 libname Modules "/folders/myfolders/Modules";
 reset storage = Modules.kNN_2D;
 load module = generate_random_2D_points;
 
 /*
 Call the module twice to create two clusters. The coordinates of the
 points are stored in X_A, Y_A, X_B, Y_B
 */
 RUN generate_random_2D_points(X_A, Y_A, &num_points, &mean_A,
                              &sigma_A*I(ncol(&mean_A)));
 
 RUN;

SQL (in SAS)


Manually create an SQL table.

%let string_length = 30;

proc sql;

create table maths (id integer primary key,
 name char(&string_length),
 area char(&string_length)
 year integer);
 
 insert into maths values (1, 'Nash', 'Riemannian geometry', 1928);
 insert into maths values (2, 'Kolmogorov', 'probability theory', 1903);
 insert into maths values (3, 'Moser', 'Hamiltonian dynamics', 1928);
 insert into maths values (4, 'Pontryagin', 'algebraic topology', 1908);
 insert into maths values (5, 'Hironaka', 'singularity theory', 1931);
 
 select * from maths;
 
quit;

The output of the select * command is

Screen Shot 2016-11-18 at 22.15.59.png

Advertisements
This entry was posted in . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s