[关闭]
@77qingliu 2018-05-05T11:15:18.000000Z 字数 7422 阅读 1719

A Hash Approach to Lookup Table in SAS(EN)

SAS


A follow up for my paper in PharmaSUG China 2017

SAS provides many ways to perform table lookups operation, such as using if-then statement in DATA step, or format statement, or merge statement, and even PROC SQL. Here, I'll introduce a new approach to look-up table which utilizes the HASH function. It's not only efficient but also simple and convenient.

What is table lookup?

Table lookup should be familiar to anyone with data analysis or database. The so-called table lookup usually refers to finding corresponding values in another data table through certain key combinations. For example, there is one table called A and there is only state ID in this dataset and there is another table called B which have both state ID and state capital, and I'd like to know the corresponding capital for each state in table A. How can we achieve this? The answer is table look up. The text description may be a little abstract, see the figure below.

Frequently used table lookup methods

  1. if-then statement
    The simplest way is the use the if-then statement directly.
  1. data states;
  2. if state = 'Virginia' then capital = 'Richmond';
  3. else if state = 'Georgia' then capital = 'Atlanta';
  4. ...

However, this way is not very flexible as the program must be modified once data is updated.
2. Merge statement
Another common operation is to use the merge statement.

  1. proc sort data = StatesCapital; by state;run;
  2. proc sort data = States; by state;run;
  3. data new;
  4. merge StatesCapital States;
  5. by state;
  6. run;

This statement is probably the most common method in SAS programmers daily life. A annoying part is that dataset should be sorted before merge operation.

  1. PROC SQL
    Advanced SAS programmers may use SQL
  1. proc sql;
  2. create table new as
  3. select a.state, b.capital
  4. from States as a, StatesCapital as b
  5. where a.state = b.state;
  6. quit;

Personally, I prefer this way. In this way, there is no need to sort the data in advance, and it is also intuitive.

These examples cited above are some classic table lookup table method in SAS, and it is also the most often used in SAS program.


Hash way to table lookup

People who like tossing may use the following method

  1. data new;
  2. if 0 then set work.StatesCapital;
  3. if _n_ = 1 then do;
  4. declare hash hh (dataset: "work.StatesCapital");
  5. hh.definedata("capital");
  6. hh.definekey("state");
  7. hh.definedone();
  8. end;
  9. set States;
  10. rc =hh.find();
  11. if rc = 0 then capital = capital;
  12. run;

if you do not know the hash method, you can seeA Hands-on Introduction to SAS® DATA Step Hash Programming Techniques

Hash in data step is very efficient. But it's a bit difficult to adult for the majority of SAS users. The first reason is that the program logic behind hash is not the same as general SAS programming, and the second is that more statements are required compared with merge or SQL and thus the program is harder to maintain.

As for the slightes improvement in efficiency, most people do not care.

So, how can we improve?
let's get familiar with the hash table algorithm before introducing the improved methods.

Hash Table algorithm

A HASH table is a data structure that enables you to accessed value quickly according to a key value.
That is, it accesses records by mapping key values to a location in the table to speed up the search. Through the hash table, the search efficiency can be improved to

The Hash table is similar to the index in the dictionary. Through the index we can quickly find the word we are looking for.
You can also compare the hash table to the summary of the article. Through the summary, we can quickly find the corresponding article.

Hash table in FCMP

in the above, we use Hash in DATA step to implement table lookup. it's a little clumsy, and I am afraid that most SAS users cannot accept such a wording. So, how can we improve it?

Starting from SAS9.2,Hash statement can be embedded in PROC FCMP to define a customed table lookup function. The specific implementation is as follows.

  1. proc fcmp outlib=work.functions.samples;
  2. * Define function name as well as the lookup key(the type of key needs to be specified at the same time). Notice that if the return value is character type, you need to specify the length and type of the return value(indicated by $);
  3. function hash_fcmp(state $) $25;
  4. * specify location of lookup table;
  5. declare hash hh(dataset: "work.StatesCapital");
  6. * specify data variable to be returned;
  7. rc=hh.definedata("capital");
  8. * specify the key
  9. rc=hh.definekey("state");
  10. rc=hh.definedone();
  11. * if find the corresponding value then return, otherwise return the specified value
  12. rc=hh.find();
  13. if rc eq 0 then return(capital);
  14. else return('Not Found');
  15. endsub;
  16. quit;
  17. * Don't forget to specify where SAS to call the function
  18. options cmplib=work.functions;
  1. * In Data Step;
  2. data new;
  3. set states;
  4. capital = hash_fcmp(state);
  5. run;
  6. * In Proc SQL;
  7. proc sql;
  8. create table new as select *,hash_fcmp(state) as capital
  9. from states;
  10. quit;

In this way, complex hash statements are nested within a SAS custom function. Users could simply invoke this function to perform table lookups.

The following table shows the basic syntax for calling HAHS in PROC FCMP

Method Syntax Description
DECLARE DECLARE hash object-name Create a new instance of hash object, create at parse time.
DEFINEKEY rc = object.DEFINEKEY('key 1','key n') Set up key variables for hash object
DEFINEDATA rc = object.DEFINEDATA('dataset 1','dataset n') Define data to be stored in hash object
DEFINEDONE rc = object.DEFINEDONE() Indicate the key and data specification is completed
DELETE rc = object.DELETE() Delete the hash object and free any resources allocated
FIND rc = object.FIND('key1','keyn') Search a hash object based on the values of defined key, If look up is successful, defined data variable are updated
CHECK rc = object.CHECK() Search a hash object based on the values of defined key, data will not be updated whether If look up is successful NUM_ITEMS rc = object. NUM_ITEMS()
ADD rc = object.ADD(key: value1, key: value n) Add data with associated key to hash object
REMOVE rc = object.REMOVE(key: value1, key: value n) Remove data with associated key to hash object

Why should we use Hash in FCMP?

The reason is simple, because it's simplicity and efficiency!

Practice

here's an straightforward and interesting application. By defining hash in FCMP, we can achieve some operations that are usually not easily implemented in SAS.

Assume that there are corresponding numeric value to each character.

char val
a 1
b 2
c 3
d 4
... ...
z 26

Given a string,you are required to sum the numeric value of this string。For example, the corresponding numeric value of 'abc' is:1 + 2 + 3 = 6

we can call this function directly to get the numeric value.

  1. data _null_;
  2. string = 'abc';
  3. num = 0;
  4. do i = 1 to lengthn(string);
  5. char = substr(string,i,1);
  6. num = num + GetNum(char);
  7. end;
  8. put num =;
  9. run;
  10. >> 6

and the customized function is beening defined like this

  1. proc fcmp outlib=width.functions.GetNum;
  2. function GetNum(char $);
  3. * Define Hash Table with Character Width Dataset;
  4. declare hash Calculate(dataset: "work.CharNum");
  5. rc = Calculate.defineKey("char");
  6. rc = Calculate.defineData("val");
  7. rc = Calculate.defineDone();
  8. * Retrieve Data from Hash and Sum up;
  9. rc = Calculate.find();
  10. if rc eq 0 then return(val);
  11. endsub;
  12. run;

you can even embed loops in the fuction to make it more simple.

  1. proc fcmp outlib=width.functions.GetNum;
  2. function GetNum(char $);
  3. * Define Hash Table with Character Width Dataset;
  4. declare hash Calculate(dataset: "work.CharNum");
  5. rc = Calculate.defineKey("char");
  6. rc = Calculate.defineData("val");
  7. rc = Calculate.defineDone();
  8. val_tot = 0;
  9. * Retrieve Data from Hash and Sum up;
  10. do i = 1 to lengthn(string);
  11. char = substr(string,i,1);
  12. rc = Calculate.find();
  13. if rc eq 0 then val_tot = val_tot+val;
  14. end;
  15. return(val_tot);
  16. endsub;
  17. run;

and call it directly in data step

  1. data _null_;
  2. string = 'abc';
  3. num = GetNum(string);
  4. put num =;
  5. run;
  6. >> 6

Because of the lack of some advanced data structures in SAS, it is a bit tricky to implement these functions. The following gives the implementation in Python, we can appreciate the power of Python.

  1. HashTable = {'a':1, 'b':2, ..., 'z':26}
  2. num = 0
  3. string = 'abc'
  4. for char in string:
  5. num += HashTable[char]
  6. print(num)
  7. >> 6

Summary

Table lookup by embedding Hash in PROC FCMP not only simply increase query efficiency, but also simplifies routine work and enhances program readability. In addition, this new style of writing also expands the functionality of the SAS to some extent.
Interested people can refer to the documentation to learn more.
Hashing in PROC FCMP to Enhance Your Productivity
Load a SAS data set into a Hash Object using PROC FCMP
PROC FCMP and DATA Step Component Objects

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注