Data hashing algorithms. Hashing concept

May 12, 2010 at 01:28 AM

Hash Algorithms

  • Information Security

As I believe, many are aware that since 2007, the US National Institute of Standards and Technology (NIST) has been holding a competition to develop a hash algorithm to replace SHA-1, and the SHA-2 family of algorithms. However, this topic, for some reason, is deprived of attention on the site. Actually, this led me to you. I bring to your attention a series of articles on hash algorithms. In this series, we will study the basics of hash functions together, consider the most famous hash algorithms, plunge into the atmosphere of the SHA-3 competition and consider the algorithms that claim to win in it, we will definitely test them. Also, if possible, Russian hashing standards will be considered.

About myself

Student of the Department of Information Security.

About hashing

Currently, virtually no cryptography application is complete without the use of hashing.
Hash functions are functions designed to "compress" an arbitrary message or data set, usually written in a binary alphabet, into some fixed length bit pattern called convolution. Hash functions have a variety of applications in statistical experiments, in testing logical devices, in building fast search algorithms and checking the integrity of records in databases. The main requirement for hash functions is the uniformity of distribution of their values \u200b\u200bfor a random selection of the argument values.
A cryptographic hash function is any hash function that is cryptographically strong, that is, satisfies a number of requirements specific to cryptographic applications. In cryptography, hash functions are used to solve the following problems:
- construction of data integrity control systems during their transfer or storage,
- data source authentication.

Any function is called a hash function h: X -\u003e Y, easily computable and such that for any message M value h (M) \u003d H (convolution) has a fixed bit length. X - a lot of all messages, Y - a set of binary vectors of fixed length.

As a rule, hash functions are built on the basis of the so-called one-step compression functions y \u003d f (x 1, x 2) two variables, where x 1, x 2 and y - binary vectors of length m, n and n respectively, and n is the convolution length, and m - message block length.
To get the value h (M) the message is first split into blocks of length m (in this case, if the message length is not a multiple m then the last block is complemented in some special way to complete), and then to the resulting blocks M 1, M 2, .., M N apply the following sequential convolution calculation procedure:

H o \u003d v,
H i \u003d f (M i, H i-1), i \u003d 1, .., N,
h (M) \u003d H N

Here v - some constant, often called an initialization vector. She gets out
for various reasons and can be a secret constant or a set of random data (sample of date and time, for example).
With this approach, the properties of the hash function are completely determined by the properties of the one-step contraction function.

There are two important types of cryptographic hash functions - key and keyless. Key hash functions are called message authentication codes. They make it possible, without additional means, to guarantee both the correctness of the data source and the integrity of the data in systems with users who trust each other.
Keyless hash functions are called error detection codes. They make it possible with the help of additional means (encryption, for example) to guarantee the integrity of the data. These hash functions can be used in systems with both trusting and non-trusting users.

About statistical properties and requirements

As I already said, the main requirement for hash functions is the uniform distribution of their values \u200b\u200bfor a random selection of the argument values. For cryptographic hash functions, it is also important that the slightest change in the argument changes the value of the function a lot. This is called the avalanche effect.

Key hashing functions have the following requirements:
- impossibility of fabrication,
- impossibility of modification.

The first requirement means that it is very difficult to find a message with the correct folding value. The second is the high complexity of matching for a given message with a known folding value of another message with the correct folding value.

The following requirements are imposed on keyless functions:
- unidirectionality,
- resistance to collisions,
- resistance to finding the second preimage.

Unidirectionality is understood as the high complexity of finding a message by a given convolution value. It should be noted that there are currently no hash functions used with proven unidirectionality.
Collision resistance refers to the difficulty of finding a pair of messages with the same convolution values. Usually it is the cryptanalysts' finding of a method for constructing collisions that serves as the first signal that the algorithm is outdated and the need to replace it soon.
The resistance to finding the second preimage is understood as the complexity of finding a second message with the same convolution value for a given message with a known convolution value.

This was a theoretical part that will be useful to us in the future ...

About popular hash algorithms

Algorithms CRC16 / 32 - checksum (not cryptographic conversion).

Algorithms MD2 / 4/5/6... They are the creation of Ron Rivest, one of the authors of the RSA algorithm.
The MD5 algorithm was once very popular, but the first prerequisites for hacking appeared in the late nineties, and now its popularity is rapidly declining.
The MD6 algorithm is a very interesting algorithm from a constructive point of view. It was nominated for the SHA-3 competition, but, unfortunately, the authors did not have time to bring it up to standard, and this algorithm is absent in the list of candidates who passed to the second round.

Ruler algorithms SHA Algorithms widely used now. There is an active transition from SHA-1 to SHA-2 version standards. SHA-2 is the collective name for the SHA224, SHA256, SHA384, and SHA512 algorithms. SHA224 and SHA384 are essentially analogs of SHA256 and SHA512, respectively, only after calculating the convolution, some of the information in it is discarded. They should only be used to ensure compatibility with older equipment.

Russian standard - GOST 34.11-94.

In the next article

Overview of MD algorithms (MD4, MD5, MD6).

Literature

A.P. Alferov, Fundamentals of Cryptography.

Bruce Schneier, Applied Cryptography.

To solve the problem of finding a necessary element among large data, an algorithm was proposed hashing (hashing - shuffling), in which keys are created that determine the data of the array and, based on them, the data is written into a table called hash table ... The keys for recording are defined using the function i \u003d h(key) called hash function ... The hashing algorithm determines the position of the desired element in the hash table by the value of its key obtained by the hash function.

Concept hashing - it is a splitting of a common (base) set of unique keys of data items into disjoint sets with a certain property.

Take a dictionary or encyclopedia, for example. In this case, the letters of the alphabet can be taken as search keys, i.e. the main element of the hashing algorithm is key (key). In most applications, the key provides an indirect reference to the data.

In fact, hashing is a special method of addressing data for quickly finding the information you need. by keys .

If the basic set contains N elements, then it can be divided into 2 N different subsets.

Hash table and hash functions

A function that maps item keys to a set of integers (indices in a table - hash table ) is called hashing function , or hash function :

i = h(key);

where key - the key to be converted, i - the resulting table index, i.e. the key maps to a set of integers ( hash addresses ), which are subsequently used to access the data.

However, a hash function for multiple key values \u200b\u200bcan give the same position value i in the table. A situation in which two or more keys get the same index (hash address) is called collision when hashing.

A good hash function is a function that minimizes collisions and distributes data evenly across the entire table, and a perfect hash function is a function that does not generate collisions:

There are two ways to resolve hashing collisions:

- by open addressing method with linear probing;

- by the method of chains.

Hash table

The hash table is an ordinary array with unusual addressing given by a hash function.

Hash structure is considered a generalization of an array that provides fast direct access to data by index.

There are many hashing schemes, differing in the choice of a good function h(key) and the conflict resolution algorithm. The effectiveness of solving a real practical problem will significantly depend on the chosen strategy.

Examples of hash functions

The hash function you choose should be easy to compute and create as few collisions as possible, i.e. should distribute keys evenly across existing indexes on the table. Of course, it is impossible to determine whether a particular hash function will distribute keys correctly if the keys are not known in advance. However, although the keys themselves are rarely known before choosing a hash function, some of the properties of these keys that affect their distribution are usually known. Let's consider the most common methods for setting a hash function.

Division method... The initial data are - some integer key key and table size m... The result of this function is the remainder of the division of this key by the size of the table. General view of the function:

int h (int key, int m) (

return key% m; // Values

For m \u003d 10 hash function returns the least significant digit of the key.

For m \u003d 100 hash function returns the two least significant digits of the key.

Additive methodwhere the key is a character string. In a hash function, a string is converted to an integer by summing all characters and the remainder of the division by m(usually table size m= 256).

int h (char * key, int m) (

Collisions occur on strings containing the same character set, for example abc and cab.

This method can be slightly modified, getting the result by summing only the first and last characters of the key string.

int h (char * key, int m) (

int len \u200b\u200b\u003d strlen (key), s \u003d 0;

if (len< 2) // Если длина ключа равна 0 или 1,

s \u003d key; // return key

s \u003d key + key;

In this case, collisions will only occur in strings, for example, abc and amc.

Mid-square method, in which the key is squared (multiplied by itself) and several middle digits of the resulting value are used as an index.

For example, the key is a 32-bit integer, and the hash function returns the middle 10 bits of its square:

int h (int key) (

key \u003e\u003e \u003d 11; // Discard the 11 least significant bits

return key% 1024; // Return the 10 least significant bits

Exclusive OR method for row keys (usually the size of the table m\u003d 256). This method is similar to the additive method, but it differentiates between similar words. The method consists in the fact that the "exclusive OR" operation is sequentially applied to the elements of the string.

IN multiplicative method additionally a random real number is used r from the interval. If this product is multiplied by the size of the table m, then the whole part of the resulting product will give a value in the range from 0 to m–1.

int h (int key, int m) (

double r \u003d key * rnd ();

r \u003d r - (int) r; // Allocated fractional part

In the general case, at large values m the indexes generated by the hash function vary widely. Moreover, mathematical theory claims that the distribution is more uniform if m is a prime number.

In the examples considered, the hash function i = h(key) only defines the position from which to search (or initially - to place in the table) a record with a key key... Therefore, the hashing scheme must include conflict resolution algorithm , which determines the order of actions if the position i = h(key) turns out to be an already occupied record with a different key.

He hash "Hash function"



he is hash, this is the English word hash, which in Russian is most often used in compound words "Hash function", "Hash sum" or "hash algorithm". Let's try to figure out what it is and what it is for.

Hashing means the deterministic (unambiguous and well-known) computation of a set of characters of a fixed length based on input of arbitrary length. In this case, changing at least one character in the original data guarantees (with a probability close to 100%) that the resulting fixed string will be different. We can say that hashing is "fingerprinting" from a large set of data.

What is all this for? Let's look at an example: you've downloaded a large file (let's say a zip archive) and want to make sure there are no errors in it. You can find out the "hash-sum" (the same fingerprint) of this file and check it against the one published on the site. If the hash-sum strings are different, then the file is definitely "broken".

Another example: in order to protect user data, the bank should not store their passwords as they are in its database. Instead, the bank stores the hash sums of these passwords and each time a password is entered, it calculates its hash sum and verifies it against the stored one in the database. And here a reasonable question arises about possible "collisions", that is, the same results of hashing different passwords. A good hash function should keep collisions to an absolute minimum, and for that it needs to be made quite complex and confusing.


Found on the list.

hashing when solving problems in C ++.

The process of searching for data in large amounts of information is time-consuming, which is due to the need to view and compare a significant number of items with the search key. The search can be shortened by localizing the viewing area. For example, sort the data by the search key, break it into non-overlapping blocks by some group criterion, or put some code in correspondence with the real data that will simplify the search procedure.

Currently, a widespread method of providing fast access to information stored in external memory is used - hashing.

Hashing (or hashing, eng. hashing) Is a conversion of an input data array of a certain type and arbitrary length to an output bit string of a fixed length. Such transformations are also called hash functions or convolution functions, and their results are called hash, hash code, hash table or digest messages (eng. message digest).

Hash table - this data structure, which implements the associative array interface, that is, it allows you to store key-value pairs and perform three operations: the operation of adding a new pair, the operation of finding and the operation of deleting a pair by key. A hash table is an array formed in a specific order by a hash function.

  • the function must be computationally simple;
  • the function should distribute the keys in the hash table as evenly as possible;
  • the function should not map any relationship between key values \u200b\u200bto a relationship between address values;
  • the function should minimize the number of collisions - that is, situations when different keys correspond to one hash value (in this case, the keys are called synonyms).

In this case, the first property of a good hash function depends on the characteristics of the computer, and the second depends on the data values.

If all data were random, then the hash functions would be very simple (for example, a few bits of the key). However, in practice, random data is quite rare, and you have to create a function that would depend on the entire key. If the hash function distributes a collection possible keys uniformly over the set of indices, then hashing effectively splits the set of keys. The worst case is when all keys are hashed into one index.

In case of collisions, it is necessary to find a new place to store keys that claim to the same cell of the hash table. Moreover, if collisions are allowed, then their number must be minimized. In some special cases, collisions can be avoided altogether. For example, if all the keys of the elements are known in advance (or very rarely change), then for them you can find some injective hash function that will distribute them among the cells of the hash table without collisions. Hash tables using similar hash functions do not need a collision resolution mechanism and are called hash tables with direct addressing.

Hash tables must match the following properties.

  • The operation in the hash table begins with the calculation of the hash function from the key. The resulting hash value is the index in the original array.
  • The number of stored array elements divided by the number of possible values \u200b\u200bof the hash function is called hash table fill factor (load factor) and is an important parameter on which the average execution time of operations depends.
  • Search, insert and delete operations should be performed on average in O (1) time. However, this estimate does not take into account the possible hardware costs of rebuilding the hash table index associated with increasing the value of the array size and adding a new pair to the hash table.
  • Collision resolution is an essential part of any hash table.

Hashing is useful when a wide range of possible values \u200b\u200bmust be stored in a small amount of memory and a fast, near-random access method is needed. Hash tables are often used in databases, and especially in language processors such as compilers and assemblers, where they speed up the processing of the ID table. As the use of hashing in everyday life, one can give examples of the distribution of books in the library according to thematic catalogs, ordering in dictionaries by the first letters of words, encrypting specialties in universities, etc.

Collision resolution methods

Collisions complicate the use of hash tables, as they violate the unambiguous correspondence between hash codes and data. However, there are ways to overcome the difficulties that arise:

  • chaining method (external or open hashing);
  • open addressing method (closed hashing).

Chaining method... The technology of adhesion of elements is that elements of the setthat match the same hash value are chained together as a list. Position number i contains a pointer to the head of the list of those elements whose key hash value is equal to i; if there are no such elements in the set, NULL is written in position i. In fig. 38.1 demonstrates the implementation of the chaining method in collision resolution. Key 002 is claimed by two values, which are organized into a linear list.


Figure: 38.1.

Each cell in the array is a pointer to a linked list (chain) of key-value pairs corresponding to the same key hash value. Collisions simply result in chains of more than one element in length.

Search or delete operations require traversing all the elements of the corresponding chain in order to find the element with the given key. To add data, you need to add an element to the end or the beginning of the corresponding list, and, if the fill factor becomes too large, increase the size of the array and rebuild the table.

Assuming that each element can go to any position in the table with equal probability and regardless of where any other element went,

Or The hash function is function, turns input data of any (usually large) size into data of a fixed size. Hashing (sometimes r yeshuvannya, eng. Hashing) - conversion of an input data array of arbitrary length into an output bit string of a fixed length. Such transformations are also called hash functions or convolution functions, and their results are called hash, hash code, hash sum, or message digest (eng. Message digest).

The hash function is used in particular in data structures - hash tables, is widely used in software for fast data retrieval. Hash functions are used to optimize tables and databases by having the same hash values \u200b\u200bin the same records. This approach of finding duplicates is effective in large files. An example of this finding of similar sites in DNA sequences. A cryptographic hash function makes it easy to check that some input data is matched against a given hash value, but if the input data is unknown, it is intentionally difficult to reconstruct the input value (or equivalent alternative) knowing the stored hash value. This is used to ensure the integrity of the transmitted data and is the building block for HMACs that provide message authentication.

Hash functions are associated (and are often confused) with sums, check digits, fingerprints, function randomization, codes, error correction, and ciphers. Although these concepts overlap to some extent, each has its own scope and requirements and is designed and optimized in different ways.

History

Donald Knuth credits the first systematic idea of \u200b\u200bhashing to IBM employee Hans Peter Lohn, who proposed the hash in January 1953.

In 1956, Arnold Douma, in his Computers and Automation, was the first to introduce the concept of hashing as most programmers know it today. Doom considered hashing as a solution to the "Dictionary problem", and also suggested using the remainder of division by a prime number as a hash address.

The first significant work related to large file search was Wesley Peterson's article in IBM Journal of Research and Development 1957 in which he defined open addressing, and also pointed out degradation in performance when deleted. Six years later, Werner Buchholz's work was published, which largely explored hash functions. Over the next several years, hashing was widely used, but no significant work was published.

In 1967, hashing in the modern sense is mentioned in the book Principles of Digital Computing by Herbert Hellerman. In 1968, Robert Morris published in Communications of the ACM great overview about hashing. This work is considered a publication that introduces the concept of hashing into scientific circulation and finally consolidates the term "hash" among specialists.

By the early 1990s, the equivalent of the term "hashing", thanks to the works of Andrei Ershov, was the word "constellation" (Russian), and for collisions, the term "conflict" (Russian) was used (Ershov had been using "constellations" from 1956, and also in the Russian-language edition of Niklaus Wirth's book "Algorithms and Data Structures" (1989) this term is used.) However, none of these options caught on, and in the literature the term "hashing" is mainly used.

Description

Hashing is used to build associative arrays, search for duplicates in series of data sets, build unique identifiers for data sets, checksum in order to detect accidental or deliberate errors during storage or transmission, to store passwords in security systems (in this case, access to the memory area " memory, where passwords are located, does not allow recovering the password itself), when generating an electronic signature (in practice, it is often not the message itself that is signed, but its hash image).

In the general case, there is no one-to-one correspondence between the original data and the hash code due to the fact that the number of values \u200b\u200bof hash functions is less than the number of variants of values \u200b\u200bof the input array. There are many arrays with different contents, but they give the same hash codes - the so-called collisions. The likelihood of collisions plays an important role in assessing the quality of hash functions.

There are many hashing algorithms with different properties (bit depth, computational complexity, cryptographic strength, etc.). The choice of one or another hash function is determined by the specifics of the problem being solved. The simplest examples of hash functions are the checksum or CRC.

Types of hash functions

A good hash function must satisfy two properties:

  • Calculate quickly;
  • Minimize the number of collisions

Let's say, for definiteness, the number of keys, and the hash function has no more than different values:

An example of a "bad" hash function is the function c, which matches a ten-digit natural number with three digits selected from the middle of the twenty-digit square of the number. It would seem that the value of the hash codes should be evenly distributed between "000" and "999", but for real data this method is suitable only if the keys do not have a large number of zeros on the left or right.

However, there are several other simple and reliable methods that many hash functions are based on.

Division-based hash functions

The first method is what we use as a hash - the remainder of the division by, where is the number of all possible hashes:

At the same time, it is obvious that with a pair, the mode of saving is paired, with a pair. And odd - with odd, which can lead to a significant shift in data in files. Also, you should not use the computer's numeral system as the base, since the hash will depend only on a few digits of the number on the right, which will lead to a lot of collisions. In practice, the simple one is usually chosen - in most cases this choice is quite satisfactory.

Another thing to say about the hashing method, which is based on division by log modulo two. In this method, it must also be a power of two, and the binary keys () have the form of polynomials. In this case, the values \u200b\u200bof the coefficients of the polynomial obtained as the remainder of the division by a previously selected polynomial of degree are taken as the hash code:

With the right choice, this method guarantees that there are no collisions between almost identical keys.

Multiplicative hashing scheme

The second method consists in choosing some integer constant, relatively simple with, where is the number of possible variants of values \u200b\u200bin the form of a computer word (in IBM PC computers). Then we can take a hash function of the form:

In this case, on a computer with a binary number system, it is a power of two, and consist of the most significant bits of the right half of the product.

Among the advantages of these two methods, it is worth noting that they take advantage of the fact that the real keys are not random. For example, in the event that the keys represent an arithmetic progression (let's say the sequence of names "name1", "name2", "name3"). The multiplicative method will display an arithmetic progression into an approximate arithmetic progression of various hash values, reducing the number of collisions compared to a random situation.

One of the variations of this method is Fibonacci hashing based on the properties of the golden ratio. Here, we choose the integer closest to, coprime with

Hashing Variable Length Strings

The above methods also apply when we need to consider keys consisting of several words or keys with variable length. For example, you can combine words into one using modulo addition or modulo 2 addition. One of the algorithms that works on this principle is the Pearson hash function.

Pearson hashing is an algorithm proposed by Peter Pearson. Peter Pearson) for processors with 8-bit registers, whose task is to quickly compute the hash code for a string of arbitrary length. The function receives a word consisting of characters, each 1 byte in size, and returns a value in the range from 0 to 255. The hash code value depends on each character of the input word.

The algorithm can be described by the following pseudocode, which takes a string as input and uses a permutation table

h: \u003d 0 For each c in W loop index: \u003d h xor ch: \u003d T End loop Return h

Among the advantages of the algorithm should be noted:

  • ease of calculation;
  • there is no input data for which the probability of collision is the highest;
  • the possibility of modification into an ideal hash function.

As an alternative way of hashing keys consisting of symbols (), one can propose the calculations

Using hash functions

Hash functions are widely used in cryptography as well as in many data structures such as hash tables, Bloom filters, and Cartesian trees.

Cryptographic hash functions

Among the many existing hash functions, it is customary to distinguish cryptographically strong ones used in cryptography, since additional requirements are imposed on them. For a hash function to be considered cryptographically strong, it must meet three basic requirements on which most uses of hash functions in cryptography are based:

  • Irreversibility: for a given hash value m it must be computationally impossible to find the data block for which.
  • Sustainability collisions of the first kind: for a given message M it must be computationally impossible to pick up another message N for which.
  • Sustainability to collisions second kind: it must be computationally impossible to find a pair of messages that have the same hash.

These requirements depend on each other:

  • The reverse function is unstable to collisions of the first and second kind.
  • A function that is not resistant to collisions of the first kind, not resistant to collisions of the second kind; the opposite is not true.

It should be noted that the existence of irreversible hash functions has not been proven, for which it is theoretically impossible to calculate any preimage of a given hash value. Usually, finding the reciprocal is only computationally difficult.

Birthday attack allows you to find collisions for a hash function with length values n bits on average for roughly hash computation. therefore n - a bit hash function is considered to be cryptic if the computational complexity of finding collisions for it is close to.

For cryptographic hash functions, it is also important that at the slightest change in the argument, the value of the function changes greatly (avalanche effect). In particular, the hash value should not leak information, even about individual bits of the argument. This requirement is a guarantee of the cryptographic strength of hashing algorithms that hashes the user's password to obtain the key.

Hashing is often used in digital signature algorithms, where not the message itself is encrypted, but its hash, which reduces the computation time and also increases the cryptographic strength. Also, in most cases, instead of passwords, the values \u200b\u200bof their hash codes are stored.

Geometric hashing

Geometric hashing (eng. Geometric hashing) - a method widely used in computer graphics and computational geometry for solving problems on a plane or in three-dimensional space, for example, to find the nearest pairs in a set of points or to search for identical images. The hash function in this method usually takes a metric space as input and divides it, creating a grid of cells. The table in this case is an array with two or more indices and is called a grid file (eng. Grid file). Geometric hashing is also used in telecommunications when dealing with multidimensional signals.

Speeding up data retrieval

A hash table is a data structure that allows you to store pairs of the form (key, hash code) and supports operations for searching, inserting and deleting elements. The task of hash tables is to speed up searches, for example, in the case of records in text fields in the database, their hash code can be calculated and the data can be placed in the section corresponding to this hash code. Then, when searching for data, you will first need to calculate the hash of the text and it will immediately become known in which section you need to search for them, that is, you will not have to search across the entire database, but only in one section of it (this greatly speeds up the search).

The everyday analogue of hashing in this case can be the placement of words in the dictionary alphabetically. The first letter of a word is its hash code, and when searching, we do not look through the entire dictionary, but only the required letter.

Did you like the article? To share with friends: