The semantic way of measuring information: essence, basic concepts and properties. What will we do with the material received. The group of general purpose standards includes

What is information? What is it based on? What are the goals and objectives? We will talk about all this in the framework of this article.

general information

In what cases is the semantic method of measuring information used? The essence of the information is used, the content side of the received message is of interest - here are the indications for its use. But first, let's give an explanation of what it is. It should be noted that the semantic way of measuring information is a difficult formalized approach that has not yet been fully formed. It is used to measure the amount of meaning in the data that was received. In other words, how much information from the received is necessary in this case. This approach is used to determine the content side of the information received. And if we are talking about a semantic way of measuring information, the concept of a thesaurus is used, which is inextricably linked with the topic under consideration. What is it like?

Thesaurus

I would like to make a small introduction and give an answer to one question about the semantic way of measuring information. Who introduced it? The founder of cybernetics, Norbert Wiener, proposed using this method, but it was significantly developed under the influence of our compatriot A. Yu. Schreider. What is the name used to indicate the totality of information that the recipient of information. If we correlate the thesaurus with the content of the message that arrived, we can find out how much it reduced the uncertainty. I want to fix one mistake, which often affects a large number of people. So, they believe that the semantic way of measuring information was introduced by Claude Shannon. It is not known exactly how this misconception arose, but this opinion is incorrect. Claude Shannon introduced a statistical method of measuring information, the "heir" of which is considered semantic.

Graphic approach for determining the amount of semantic information in a received message

Why do you need to draw something? The semantic method of measurement uses this opportunity to visually present data on the usefulness of data in the form of easily understood figures. What does this mean in practice? To clarify the state of affairs, a dependence is built in the form of a graph. If the user does not have knowledge about the essence of the message that was received (equal to zero), then the amount of semantic information will be equal to the same value. Is it possible to find the optimal value? Yes! This is the name of the thesaurus, where the amount of semantic information is maximum. Let's look at a small example. Suppose a user receives a message written in an unfamiliar foreign language, or a person can read what is written there, but this is no longer news to him, since all this is known. In such cases, it is said that the message contains zero semantic information.

Historical development

Probably, we should talk about this a little higher, but to catch up is not too late. The semantic way of measuring information was originally introduced by Ralph Hartley in 1928. It was previously mentioned that Claude Shannon is often mentioned as the founder. Why did such confusion arise? The fact is that although the semantic method of measuring information was introduced by Ralph Hartley in 1928, it was Claude Shannon and Warren Weaver who generalized it in 1948. After that, the founder of cybernetics, Norbert Wiener, formed the idea of \u200b\u200bthe thesaurus method, which received the most recognition in the form of a measure developed by Yu. I. Schneider. It should be noted that in order to understand this, a sufficiently high level of knowledge is required.

Performance

What does the thesaurus method give us in practice? It is a real confirmation of the thesis that information has such a property as relativity. It should be noted that it has a relative (or subjective) value. In order to be able to objectively evaluate scientific information, we introduced the concept of a universal thesaurus. Its degree of change also shows the significance of knowledge that mankind receives. However, it is impossible to say exactly which final result (or intermediate) can be obtained from the information. Take computers, for example. Computing technique was created on the basis of lamp technology and the bit state of each structural element and was originally used for calculations. Now, almost every person has something that works on the basis of this technology: radio, telephone, computer, TV, laptop. Even modern refrigerators, stoves and washbasins contain a bit of electronics, which is based on information on facilitating the use of these household appliances by a person.

Scientific approach

Where is the semantic way of measuring information studied? Computer science is the science that deals with various aspects of this issue. What is the feature? The method is based on the use of the system "true / false", or the bit system "unit / zero". When certain information arrives, it is divided into separate blocks, which are called like units of speech: words, syllables and the like. Each block receives a specific value. Let's look at a small example. Nearby are two friends. One addresses the second with the words: "Tomorrow is our day off." When the days for rest - everyone knows. Therefore, the value of this information is zero. But if the second says that he works tomorrow, then for the first it will be a surprise. Indeed, in this case, it may turn out that plans that were built by one person will be violated, for example, going to play bowling or delving into the workshop. Each part of the described example can be described using units and zeros.

Operating with concepts

But what else is used besides the thesaurus? What else do you need to know to understand the semantic way of measuring information? The basic concepts that can be further explored are sign systems. They mean means of expressing meaning, such as rules for interpreting signs or their combinations. Let's look at another example from computer science. Computers operate with conditional zeros and ones. In fact, this is a low and high voltage that is applied to the components of the equipment. Moreover, they transmit these units and zeros without end and edge. How to make a distinction between them technique? The answer to this was found - interruptions. When this same information is transmitted, then various blocks are obtained, such as words, phrases, and individual meanings. In oral human speech, pauses are also used to break data into separate blocks. They are so inconspicuous that most of them we notice on the "machine." In the letter, dots and commas are used for this purpose.

Features

Let's also touch on the topic of properties that the semantic way of measuring information has. We already know that this is the name of a special approach that evaluates the importance of information. Is it possible to say that the data that will be evaluated in this way will be objective? No, that is not true. Information is subjective. Let's look at an example of a school. There is an excellent student who goes ahead of the approved program, and the average average student who studies what is presented in the classroom. For the first, most of the information that he will receive at school will be of rather weak interest, since he already knows this and does not hear / read for the first time. Therefore, at the subjective level, this will not be very valuable for him (due to perhaps the individual comments of the teacher, which he noticed during the presentation of his subject). While the average person heard about the new information only remotely, therefore for him the value of the data that will be presented in the lessons is an order of magnitude greater.

Conclusion

It should be noted that in computer science, the semantic way of measuring information is not the only option within which it is possible to solve existing problems. The choice should depend on the goals set and the opportunities present. Therefore, if a topic is of interest or if there is a need for it, we can only strongly recommend that you study it in more detail and find out what other ways of measuring information, except semantic, exist.

To measure information, two parameters are introduced: the amount of information I and the amount of data V d.

These parameters have different expressions and interpretations depending on the form of adequacy under consideration.

Syntactic adequacy.   It displays the formal and structural characteristics of information and does not affect its semantic content. At the syntactic level, the type of medium and the way of presenting information, the speed of transmission and processing, the sizes of the codes for presenting information, the reliability and accuracy of the conversion of these codes, etc. are taken into account.

Information considered only from a syntactic point of view is usually called data, since the semantic side does not matter.

Semantic (semantic) adequacy.   This form determines the degree of conformity of the image of the object and the object itself. The semantic aspect involves the consideration of the semantic content of information. At this level, the information that reflects the information is analyzed, semantic relationships are considered. In computer science, semantic connections are established between the codes for the presentation of information. This form serves to form concepts and concepts, identify the meaning, content of information and its generalization.

Pragmatic (consumer) adequacy.   It reflects the relationship of information and its consumer, the correspondence of information to the management goal, which is based on it. The pragmatic properties of information are manifested only in the presence of a unity of information (object), user and management goal.

Pragmatic aspect   consideration is associated with the value, usefulness of using information in the development of consumer decisions to achieve their goals. From this point of view, the consumer properties of information are analyzed. This form of adequacy is directly related to the practical use of information, with the correspondence of its target function of the system.

Each form of adequacy has its own measure of the amount of information and the amount of data (Fig. 2.1).

Fig. 2.1. Information Measures

2.2.1. Syntactic measure of information

Syntax measure   the amount of information operates with anonymized information that does not express a semantic relationship to the object.

The amount of data V d in the message is measured by the number of characters (bits) in this message. In different number systems, one digit has a different weight and the unit of measurement of data accordingly changes:

  • in binary notation, the unit of measure is bit (   bit   - binary digit - binary digit);
  • in decimal notation, the unit of measurement is dith (decimal).

Example. A message in the binary system in the form of an eight-bit binary code 10111011 has a data volume V d \u003d 8 bits.

The message in the decimal system in the form of a six-digit number 275903 has a data volume of V d \u003d 6 dit.

The amount of information is determined by the formula:

where H (α) is the entropy, i.e. the amount of information is measured by a change (decrease) in the uncertainty of the state of the system.

The entropy of the system H (α), having N possible states, according to the Shannon formula, is equal to:

where p i is the probability that the system is in the i-th state.

For the case when all states of the system are equally probable, its entropy is determined by the relation

where N is the number of all possible displayed states;

m - the base of the number system (variety of characters used in the alphabet);

n is the number of bits (characters) in the message.

2.2.2. Semantic measure of information

To measure the semantic content of information, i.e. its quantity at the semantic level, the thesaurus measure, which connects the semantic properties of information with the user's ability to receive an incoming message, is most recognized. The concept is used for this. user thesaurus.

A thesaurus is a collection of information that a user or system has.

Depending on the ratio of the semantic content of the information S and the user's thesaurus S p, the amount of semantic information I c changes, perceived by the user and included by him later in his thesaurus. The nature of this dependence is shown in Fig.2.2:

  • when S p \u003d 0, the user does not perceive, does not understand the incoming information;
  • as S p → ∞, the user knows everything, he does not need incoming information.

Fig. 2.2. Dependence of the amount of semantic information perceived by the consumer on his thesaurus I c \u003d f (S p)

When assessing the semantic (meaningful) aspect of information, it is necessary to strive for the coordination of the values \u200b\u200bof S and S p.

A relative measure of the amount of semantic information can be the content factor C, which is defined as the ratio of the amount of semantic information to its volume:

2.2.3. Pragmatic measure of information

This measure determines the usefulness of information (value) to achieve the goal set by the user. This measure is also a relative value, due to the features of the use of information in a particular system. It is advisable to measure the value of information in the same units (or close to them) in which the objective function is measured.

For comparison, the introduced measures of information are presented in table. 2.1.

Table 2.1. Information Units and Examples

Measure of Information Units Examples (for the computer field)
Syntactic:

shannon approach

computer approach

Degree of reduction in uncertainty Event probability
Information Presentation Units Bit, Byte, Kbyte, etc.
Semantic Thesaurus Application package, personal computer, computer networks, etc.
Economic indicators Profitability, productivity, depreciation rate, etc.
Pragmatic Value of use Monetary expression
Memory capacity, computer performance, data transfer rate, etc. Information Processing and Decision Making Time

per average state, is called entropy of a discrete source of information

mats.

H p i logp i

i 1 N

If we again focus on measuring uncertainty in binary units, then the base of the logarithm should be taken equal to two.

H p ilog 2 p i

i 1 N

With equiprobable elections all

p log

and formula (5) is converted to R. Hartley's formula (2):

1 log2

N log2

The proposed measure was called entropy not by chance. The fact is that the formal structure of expression (4) coincides with the entropy of the physical system, previously determined by Boltzmann. According to the second law of thermodynamics, the entropy of a closed space is determined by the expression

P i 1

growth then

can be written as

p iln

i 1 N

This formula completely coincides with (4)

In both cases, the value characterizes the degree of diversity of the system.

Using formulas (3) and (5), we can determine the redundancy of the source alphabet

Which shows how rationally the characters of this alphabet are used:

) is the maximum possible entropy determined by the formula (3);

() - entropy

source, determined by the formula (5).

The essence of this measure is that with an equally probable choice, the same information load on a sign can be provided using an alphabet of smaller size than in the case of an unequal choice.

Semantic Level Information Measures

To measure the semantic content of information, i.e. its quantity at the semantic level, the most widely used is the thesaurus measure, which connects the semantic properties of information with the user's ability to receive an incoming message. Indeed, to understand and use the information received, the recipient must have a certain supply of knowledge. Complete ignorance of the subject does not allow us to extract useful information from the received message about this subject. As knowledge of the subject grows, so does the amount of useful information extracted from the message.

If we call the recipient’s knowledge of the subject “thesaurus” (that is, a set of words, concepts, names of objects connected by semantic links), then the amount of information contained in a message can be estimated by the degree of change of an individual thesaurus under the influence of this message .

A thesaurus is a collection of information held by a user or system.

In other words, the amount of semantic information extracted by the recipient from incoming messages depends on the degree of preparedness of his thesaurus for the perception of such information.

Depending on the relationship between the semantic content of the information and the user’s thesaurus, the amount of semantic information perceived by the user and included in the future in his thesaurus changes. The nature of this relationship is shown in Figure 3. Consider two limiting cases when the amount of semantic information is

Figure 3 - Dependence of the amount of semantic information perceived by the consumer on his thesaurus ()

The consumer acquires the maximum amount of semantic information when

the insertion of its semantic content with its thesaurus (), when the incoming information is clear to the user and brings him previously unknown (absent in his thesaurus) information.

Therefore, the amount of semantic information in the message, the amount of new knowledge acquired by the user, is a relative value. One and the same message can have semantic content for a competent user and be meaningless for an incompetent user.

When assessing the semantic (meaningful) aspect of information, it is necessary to strive to agree on the values \u200b\u200bof and.

A relative measure of the amount of semantic information can be the content factor, which is defined as the ratio of the amount of semantic information to its volume:

Another approach to the semantic evaluation of information, developed in the framework of science of science, is that as the main indicator of the semantic value of the information contained in the analyzed document (message, publication), the number of links to it in other documents is taken. Specific indicators are formed on the basis of statistical processing of the number of links in various samples.

Pragmatic Information Measures

This measure determines the usefulness of information (value) to achieve the goal set by the user. It is also a relative value, due to the features of the use of this information in a particular system.

A. A. Kharkevich addressed this problem as one of the first Russian scientists, who suggested taking the amount of information necessary to achieve the goal as a measure of the value of information, i.e. calculate the increment of the probability of achieving the goal. So if

Thus, the value of information is measured in units of information, in this case, in bits.

Expression (7) can be considered as the result of normalizing the number of outcomes. In the explanation in Figure 4, there are three schemes that take the same values \u200b\u200bfor the number of outcomes 2 and 6 for points 0 and 1, respectively. The starting position is point 0. Based on the information received, a transition is made to point 1. The target is indicated by a cross. Favorable outcomes are depicted by lines leading to the goal. We determine the value of the information obtained in all three cases:

a) the number of favorable outcomes is three:

and therefore

b) there is one favorable outcome:

c) the number of favorable outcomes is four:

In example b) the negative value of information (negative information) is obtained. Such information, which increases the initial uncertainty and reduces the probability of achieving a goal, is called disinformation. Thus, in the first place) we received misinformation of 1.58 binary units.

INFORMATION TRANSFER LEVELS

When implementing information processes, information is always transferred in space and time from the source of information to the receiver (receiver). At the same time, various signs or symbols are used to transmit information, for example, a natural or artificial (formal) language, which allows expressing it in some form called a message.

Message- a form for representing information in the form of a set of signs (symbols) used for transmission.

Communication as a set of signs in terms of semiotics (from Greek semeion -sign, sign) - a science that studies the properties of signs and sign systems - can be studied at three levels:

1) syntacticwhere the internal properties of messages are considered, that is, the relations between signs, reflecting the structure of a given sign system. External properties are studied at the semantic and pragmatic levels;

2) semanticwhere the relationship between the signs and the objects designated by them, actions, qualities, that is, the semantic content of the message, its relation to the source of information is analyzed;

3) pragmaticwhere the relationship between the message and the recipient is considered, that is, the consumer content of the message, its relationship to the recipient.

Thus, taking into account a certain relationship between the problems of information transfer and the levels of study of sign systems, they are divided into three levels: syntactic, semantic and pragmatic.

Problems syntax levelthey relate to the creation of the theoretical foundations of building information systems, the main indicators of which would be close to the maximum possible, as well as improving existing systems in order to increase the efficiency of their use. These are purely technical problems of improving the methods of transmitting messages and their material carriers - signals. At this level, the problems of delivering messages to the recipient as a set of characters are considered, taking into account the type of medium and method of presenting information, the speed of transmission and processing, the size of the codes for presenting information, the reliability and accuracy of the conversion of these codes, etc., completely abstracting from the semantic content of the messages and their intended purpose. At this level, information considered only from a syntactic point of view is usually called data, since the semantic side does not matter.

The modern theory of information mainly investigates problems of this particular level. It is based on the concept of “amount of information”, which is a measure of the frequency of use of signs, which in no way reflects the meaning or importance of the transmitted messages. In this regard, it is sometimes said that the modern theory of information is at the syntactic level.

Problems semantic levelassociated with the formalization and consideration of the meaning of the transmitted information, determining the degree of conformity of the image of the object and the object itself. At this level, the information that reflects the information is analyzed, semantic connections are considered, concepts and ideas are formed, the meaning, content of the information is revealed, its generalization is carried out.

The problems of this level are extremely complex, since the semantic content of the information depends more on the recipient than on the semantics of the message presented in any language.

At a pragmatic level, they are interested in the consequences of receiving and using this information by the consumer. The problems of this level are associated with the determination of the value and usefulness of the use of information in the development of consumer solutions to achieve their goals. The main difficulty here is that the value, usefulness of information can be completely different for different recipients and, in addition, it depends on a number of factors, such as the timeliness of its delivery and use. High requirements regarding the speed of information delivery are often dictated by the fact that control actions must be carried out in real time, that is, with the rate of change in the state of controlled objects or processes. Delays in the delivery or use of information can be disastrous.

As already noted, the concept of information can be considered under various restrictions imposed on its properties, i.e. at various levels of consideration. Basically, there are three levels - syntactic, semantic and pragmatic. Accordingly, on each of them, various estimates are used to determine the amount of information.

At the syntactic level, probabilistic methods are used to estimate the amount of information, which take into account only the probabilistic properties of information and do not take into account others (semantic content, usefulness, relevance, etc.). Developed in the middle of XX century. mathematical and, in particular, probabilistic methods made it possible to formulate an approach to assessing the amount of information as a measure of reducing the uncertainty of knowledge.

This approach, also called probabilistic, postulates the principle: if a message leads to a decrease in the uncertainty of our knowledge, then it can be argued that such a message contains information. In this case, the messages contain information about any events that may occur with different probabilities.

The formula for determining the amount of information for events with different probabilities and obtained from a discrete source of information was proposed by the American scientist C. Shannon in 1948. According to this formula, the amount of information can be determined as follows:

Where I   - amount of information; N   - the number of possible events (messages); p i   - the probability of individual events (messages).

The amount of information determined using formula (2.1) takes only a positive value. Since the probability of individual events is less than unity, then the expression log 2, - is a negative quantity and, to obtain a positive value of the amount of information in formula (2.1), the minus sign is in front of the sum sign.

If the probability of occurrence of individual events is the same and they form a complete group of events, i.e.:

then formula (2.1) is transformed into R. Hartley's formula:

In formulas (2.1) and (2.2), the relationship between the amount of information I   and accordingly, the probability (or quantity) of individual events is expressed using the logarithm.

The use of logarithms in formulas (2.1) and (2.2) can be explained as follows. For simplicity of reasoning, we use the relation (2.2). We will sequentially assign the argument N   values \u200b\u200bselected, for example, from a series of numbers: 1, 2, 4, 8, 16, 32, 64, etc. To determine which event from N   of equally probable events, for each number of the series, it is necessary to sequentially perform operations of choosing from two possible events.

So, with N   \u003d 1 the number of operations will be 0 (the probability of the event is 1), with N   \u003d 2, the number of operations will be 1, with N   \u003d 4 the number of operations will be 2, with N   \u003d 8, the number of operations will be 3, etc. Thus, we obtain the following series of numbers: 0, 1, 2, 3, 4, 5, 6, etc., which can be considered the corresponding values \u200b\u200bof the function I   in relation (2.2).

A sequence of numbers that takes an argument N, is a series known in mathematics as a series of numbers forming a geometric progression, and a sequence of values \u200b\u200bof numbers that a function takes Iwill be a series forming an arithmetic progression. Thus, the logarithm in formulas (2.1) and (2.2) establishes the relation between the series representing geometric and arithmetic progressions, which is quite well known in mathematics.

For the quantitative determination (estimation) of any physical quantity, it is necessary to determine the unit of measurement, which in the theory of measurements is called measures .


As already noted, information must be encoded before processing, transferring and storing.

Coding is performed using special alphabets (sign systems). In computer science, which studies the processes of obtaining, processing, transmitting and storing information using computing (computer) systems, binary coding is mainly used, in which a sign system consisting of two characters 0 and 1 is used. For this reason, in formulas (2.1) and (2.2) the number 2 is used as the base of the logarithm.

Based on the probabilistic approach to determining the amount of information, these two characters of the binary sign system can be considered as two different possible events, therefore, the amount of information is taken as the amount of information that contains a message that halves the knowledge uncertainty by half (before the events are received, their probability is 0 , 5, after receiving - 1, the uncertainty decreases accordingly: 1 / 0.5 \u003d 2, i.e. 2 times). Such a unit of information is called a bit (from the English word binary digit   Is a binary digit). Thus, as a measure for estimating the amount of information at the syntax level, subject to binary coding, one bit is adopted.

The next largest unit of measurement of the amount of information is a byte, which is a sequence composed of eight bits, i.e.:

1 byte \u003d 2 3 bits \u003d 8 bits.

In informatics, units of byte multiples of the amount of information are also widely used, however, unlike the metric system, where the factors of multiples are the factor 10n, where n \u003d 3, 6, 9, etc., in multiples of the amount of information coefficient 2n is used. This choice is explained by the fact that the computer mainly operates with numbers not in decimal, but in binary.

Multiples of bytes units of information quantity are entered as follows:

1 kilobyte (KB) \u003d 210 bytes \u003d 1024 bytes;

1 megabyte (MB) \u003d 210 KB \u003d 1024 KB;

1 gigabyte (GB) \u003d 210 MB \u003d 1024 MB;

1 terabyte (TB) \u003d 210 GB \u003d 1024 GB;

1 petabyte (PB) \u003d 210 TB \u003d 1024 TB;

1 exabyte (Eb) \u003d 210 Pbytes \u003d 1024 Pbytes.

The units of measurement of the amount of information in the name of which are prefixes “kilo”, “mega”, etc., are not correct from the point of view of measurement theory, since these prefixes are used in the metric system of measures in which the coefficient is used as multipliers of multiple units 10 n, where n \u003d 3, 6, 9, etc. To eliminate this incorrectness, an international organization International Electrotechnical Commission, which is creating standards for the electronic technology industry, it has approved a number of new consoles for units for measuring the amount of information: kibi (kibi), mebi (mebi), gibi (gibi), tebi (tebi), peti (peti), exby (exbi). However, for now, the old notation is used for units of measure of the amount of information, and it takes time for the new names to be widely used.

A probabilistic approach is also used in determining the amount of information presented using sign systems. If we consider the symbols of the alphabet as the set of possible messages N, then the amount of information that carries one character of the alphabet can be determined by the formula (2.1). If each character of the alphabet appears equally in the message text, one can use formula (2.2) to determine the amount of information.

The amount of information that carries one character of the alphabet, the more, the more characters are included in this alphabet. The number of characters in the alphabet is called the power of the alphabet. The amount of information (information volume) contained in a message encoded using a sign system and containing a certain number of characters (symbols) is determined using the formula:

where V   - information volume of the message; I= log 2 N, the information volume of one symbol (sign); TO   - the number of characters (characters) in the message; N   - power of the alphabet (the number of characters in the alphabet).

Do you like the article? Share with friends: