All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.

Related Documents

Share

Description

IMPORTANT

Transcript

SELECTION OF HASHING ALGORITHMS
Tim Boland Gary Fisher
JUNE 30, 2000INTRODUCTION
The National Software Reference Library (NSRL) Reference Data Set (RDS) is built on file signature generation technology that is used primarily in cryptography. The selection of the specific file signature generation routines is based on customer requirements and the necessity to proide a leel of confidence in the reference data that will allow it to be used in the !.S. ourts. This document gies an oeriew of the arious hashing algorithms considered# as well as implementations of those algorithms. $t also gies factors regarding their selection and use.%ashing is an e&tremely good way to erify the integrity of a sequence of data bits (e.g.# to ma'e sure the contents of the sequence haent been changed inadertently). The sequence might ma'e up a character string# a file# a directory# or a message representing data (binary s or *s) stored in a computer system. The word +hash, means to +chop into small pieces, (R-). / hashing algorithm is a mathematical function (or a series of functions) ta'ing as input the aforementioned sequence of bits and generating as output a code (alue) produced from the data bits and possibly including both code and data bits. Two files with e&actly the same bit patterns should hash to the same code using the same hashing algorithm. $f a hash for a file stays the same# there is only an e&tremely small probability that the file has been changed. 0n the other hand# if the hashes for the files donot match# then the files are not the same. Thus# hashes could be used as a primary erification tool to find identical files. The output code of the hash function should hae a+random, property# so that different sequences of bits hash to different alues as much as possible. %ashes are used in +scatter storage, systems# in digital signature applications# and recently in computer forensics applications# to determine whether the contents of a suspect machine hae been modified maliciously. %ashing algorithms can be efficiently implemented on modern computers.%ashing algorithms fall within the realm of error detection techniques. $n a general sense#the aim of an error detection technique is to enable the receier of a message transmitted through a noisy (error1introducing) channel to determine whether the message has been corrupted. Some hashing algorithms perform comple& transformations on the message to in2ect it with redundant information# while others leae the data intact and append a hash alue on the end of a message. $n any case# the transmitter may construct a hash alue that is a function of the message. The receier can then use the same hashing algorithm tocompute the hash alue of the receied message and compare it with the transmitted hashalue to see if the message was correctly receied. $f the hash alues match# then the
message was correctly receied3 if not# then there must hae been an error in one or more of the data bits of the message. The National $nstitute of Standards and Technology (N$ST) of the !.S. Department of ommerce has been as'ed to inestigate commonly1used hashing algorithms in support of the National Software Reference Library (NSRL). There are seeral algorithms aailable# differing in comple&ity# robustness# ease of use# and machine efficiency. 4enerally# +hardware (physical deice), methods of computing hashes inole e&tensie bit manipulations# and are relatiely inefficient for a software (programmatic) computation# so the algorithms discussed contain software methodologies to streamline the hashing process. 5hat is inoled in all the algorithms is a method to brea' up the input into manageable portions# and manipulate the input in a systematic way oer and oer (iteratiely). The algorithms generally differ in the degree to which they do this# and the number of iterations inoled. 4ien the aboe# N$ST inestigated aailable implementations of four different hashing algorithms and tested the algorithm output on some test data. The purpose of this e&ercisewas to define some +reference, implementations for erifying the correctness of entries in the NSRL Reference Data Set (RDS). 6ultiple algorithms were considered because of the need for +double1chec'ing, results and because many facilities use multiple hashing algorithms simultaneously. 7erformance metrics mentioned aboe were used in ealuating candidate implementations. or each algorithm# the authorities (official sources and sanctions) of the source programsand tests used for testing accuracy will be described. /lgorithms mentioned in this report may hae limitations (clashes found# performance# etc.)# which will be mentioned as appropriate. /ll implementations ealuated are freely aailable from the $nternet. -ach of the four algorithms will be described below# starting with a high1leel oeriew and progressing to more detail as appropriate.
CRC32
The cyclic redundancy code ( R ) algorithm is the simplest of the four hashing algorithm choices# but also the least robust. The name means that the algorithm operates in repetitie (cyclic) redundant cycles to produce an output hash code. The +89, indicatesthe number of bits being considered to produce the hash code (e&plained below). The R algorithm is a 'ey component in the error1detecting capabilities of many communications protocols. $n a R algorithm# the transmitter of a message constructs a alue (called the chec'sum) and appends it to the message. The receier can then use the same function to compute the chec'sum of the receied message and compare it with the appended chec'sum to see if the message was correctly receied. or e&le# if we chose a chec'sum function which was the sum of the decimal numbers in a message# it might go something as follows: 6essage1 9 8# 6essage with chec'sum ; 9 8 < (< is sum of and 9 and 8)# 6essage after transmission ; 9 = <. %ere the third decimal number was corrupted from 8 to =# and the receier can detect this by computing the chec'sum (>?@9@=) from the message# and compare it with the transmitted chec'sum of <. 0biously# both sender and receier must be using the same algorithm to be consistent.
$f the chec'sum itself is corrupted# a correctly transmitted message might be incorrectly identified as a corrupted one. %oweer# this is a side1safe failure. / dangerous1side failure occurs where the message andAor chec'sum is corrupted in a manner that results ina transmission that is internally consistent. !nfortunately# this possibility is completely unaoidable and the best that can be done is to minimiBe its probability by increasing the amount of information in the chec'sum (R-9).The aboe e&le is obiously ery simple# and would not suffice for rigorous error detection. / more comple& chec'sum function is needed. 5hile addition is clearly not strong enough to form an effectie chec'sum# it turns out that diision is# so long as the diisor (number to diide by) is about as wide as the chec'sum register (place to store thechec'sum alue). The basic idea behind R algorithms is simply to treat the message as an enormous binary number# to diide it by another fi&ed binary number# get a quotient# and ma'e the remainder from this diision the chec'sum. !pon receipt of the message# the receier can perform the same diision and compare the remainder with the +chec'sum, (transmitted remainder). or e&le# when diiding decimal (message) by = (diisor) we get a alue of 9 (quotient) with a remainder of 8.5ith R diision# instead of iewing the numbers mentioned aboe as positie integers# they are iewed as polynomials with binary coefficients. This is done by treatingeach number as a bit1string whose bits are the coefficients of a polynomial. or e&le# the ordinary number 98 (decimal) is * (binary) and so it corresponds to the polynomial &CC= @ &CC9 @ &CC @ &CC*. 7olynomials are used because they proide useful mathematical machinery in the calculations. R arithmetic is primarily about 0Ring (e&clusie10Ring) particular alues at arious shifting offsets# which has the effect of doing the binary diision. /n e&clusie10R function produces if the two input bits are different3 otherwise it produces *. The R algorithm can be applied to messages of different widths (9# <# or 89 bits). 5e are considering the 891bit ( R 89) algorithm here because it is the most robust.$n this case the polynomial is 89 bits wide and the R 89 chec'sum is also 89 bits.This also simplifies the calculation on most modern computers. 0ther R polynomials used besides R 89 are R 9# R <# and R 1 $TT# from the onsultatie ommittee for Telephone and Telegraph ( $TT). 0n 7 s one can deal with binary numbers of only 89 bits or fewer# so one must brea' up the enormous binary number mentioned aboe into manageable chun's. Thats e&actly what the two R algorithms mentioned below do. $n order to speed up the process# the algorithms use a pre1calculated loo'1up table3 the table contains a R for each character code between * and 9EE# so that the calculation doesnt need to be repeated as the te&t strings are processed. This process has the effect of performing the diision of the enormous binary number by the generator polynomial# but in increments# due to the limitations of modern computing. $n other words# instead of computing the R bit by
bit# a 9E<1element loo'up table can be used to perform the equialent of F bit operations at a time. To perform a R calculation# the user needs to choose a diisor. 4enerally the diisor iscalled the +generator polynomial, or simply the +polynomial,# and is a 'ey parameter of any R algorithm. 0ne can choose any polynomial and come up with a R algorithm.%oweer# some polynomials are better then others. /n e&le of a polynomial used might be >G#><=#GG decimal# or *&*=cdb> he&adecimal. Theoretical mathematicians hae calculated certain polynomials to proide the least duplications in remainders.
CRC Implementation
To implement a R algorithm is to implement R diision. There are two reasons why the diide instruction of the host machine cannot be used. The first is that the diision must be in R arithmetic. The second is that the diidend might be ten megabytes ( byte?F bits) long# and todays processors do not hae registers large enoughto hold a diidend of this siBe. To implement R diision# we hae to feed the message in smaller chun's through a diision register.0riginally there were seen candidate R 89 implementations (using or @@ high1leel programming languages) under consideration (which represents nearly all of the researched R 89 implementations publicly aailable). 7erformance metrics used to ealuate these implementations were the following (in no particular order of importance):speed of e&ecution# ease of use# accuracy# ability to operate on entire files# and choice of generator polynomial. 0ne implementation was re2ected because it did not produce accurate results# two were not set up to operate on entire files (only te&t strings)# and two were slow (because they were not +table1drien,). 0nly two were reasonably fast# produced accurate information# were table drien# and used generally accepted generator polynomials. Hoth are table1drien# but one uses a polynomial is from an /merican National Standards $nstitute (/NS$) 8 ommittee# while the other polynomial is not e&plicitly specified in code# but the table entries are the same as compared to the other. The two implementations are about the same number of programming statements. Slight preference was gien for the algorithm that computes alues for directories of files as well as indiidual files.The test data used to erify the routines was from commonly used 7IJ$7 (R-8) and 5$NJ$7 (R-=) products# and other arious test character strings and file directories. Since these products are commonly used and routinely generate R s# they would be alid benchmar's of accuracy. The R outputs are in he&. Hoth implementations erified correctly against the data. There are no apparent limitations in the implementations# other than the inherent R 89 limitations# although one implementation produces more cursory output on only one file at a time. 6ore information on each of these implementations is gien below.The first candidate R program (using the language) computes the 891bit R used as the frame chec' (error1detection) sequence in $7S > (R-E) This source code is from the Snippets file collection (R-<). $t consists of a header file# crc.h# and a main

We Need Your Support

Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks