What is hard drive cache and what is it for? Impact of Cache Capacity on 3rd Gen Core i5 Performance Processor Memory Size

Good day to all. Today we will try to explain to you such a thing as a cache. The processor cache is an ultra-fast data processing array that is 16-17 times faster than standard RAM when it comes to DDR4.

From this article you will learn:

It is the amount of cache memory that allows the CPU to operate at maximum speeds, without waiting for the RAM to process any data and send the results of the finished calculations to the chip for further processing. A similar principle can be traced in the HDD, only a buffer of 8–128 MB is used there. Another thing is that the speeds are much lower, but the process is similar.

What is a processor cache?

How does the calculation process work in general? All data is stored in RAM, which is designed for temporary storage of important user and system information. The processor selects for itself a certain number of tasks, which are driven into an ultra-fast block called cache memory, and begins to do its direct duties.

The calculation results are again sent to RAM, but in a much smaller amount (instead of a thousand output values, we get much less), and a new array is taken for processing. And so on until the job is done.

The speed of work is determined by the efficiency of RAM. But not a single modern DDR4 module, including overclocking solutions with frequencies under 4000 MHz, was close to the capabilities of the most stunted processor with its “slow” cache.

This is because the speed of the CPU exceeds the performance of the RAM on average 15 times, or even higher. And do not look only at the frequency parameters, in addition to them there are enough differences.
In theory, it turns out that even the super-powerful Intel Xeon and AMD Epyc are forced to idle, but in fact both server chips are working at their limit. And all because they collect the required amount of data by the size of the cache (up to 60 MB or more) and process the data instantly. RAM serves as a kind of warehouse from which arrays are scooped for calculations. The computing efficiency of the computer increases and everyone is happy.

A brief excursion into history

The first mention of cache memory dates back to the end of the 80s. Until that time, the speed of the processor and memory were approximately the same. The rapid development of chips required some kind of “crutch” to increase the speed of RAM, but using ultra-fast chips was very expensive, and therefore they decided to get by with a more economical option - the introduction of a high-speed memory array in the CPU.

The cache memory module first appeared in the Intel 80386. At that time, DRAM latencies hovered around 120 nanoseconds, while a more modern SRAM module reduced latency to an impressive 10 nanoseconds at that time. An approximate picture is more clearly demonstrated in the confrontation between HDD and SSD.

Initially, cache memory was soldered directly on motherboards, due to the level of the technical process of that time. Starting with the Intel 80486, 8 kb of memory was incorporated directly into the processor die, further increasing performance and reducing die area.

This layout technology remained relevant only until the release of Pentium MMX, after which SRAM-memory was replaced by more advanced SDRAM.
And the processors have become much smaller, and therefore the need for external circuits has disappeared.

Cache levels

On the marking of modern CPUs, in addition to and, you can find such a thing as a cache size of 1,2 and 3 levels. How is it defined and what does it affect? Let's understand in simple terms.

The first level (L1) cache is the most important and fastest chip in the CPU architecture. One processor can accommodate the number of modules equal to the number of cores. It is noteworthy that the microcircuit can store in memory the most demanded and important data only from its core. The size of the array is often limited to 32-64 KB.
Cache of the second level (L2) - the drop in speed is compensated by an increase in the size of the buffer, which reaches 256 or even 512 KB. The principle of operation is the same as that of L1, but the memory request frequency is lower, due to the storage of less priority data in it.
The third level (L3) cache is the slowest and most voluminous partition among all those listed. Still, this array is much faster than RAM. The size can reach 20, and even 60 MB, when it comes to server chips. The benefit of the array is enormous: it is a key link in the exchange of data between all the cores of the system. Without L3, all elements of the chip would be scattered.

On sale you can find both two- and three-level memory structure. Which one is better? If you use the processor only for office programs and casual games, then you will not feel any difference. If the system is assembled with an eye on complex 3D games, archiving, rendering and graphics, then the increase in some cases will range from 5 to 10%.
A L3 cache is justified only if you intend to regularly work with multi-threaded applications that require regular complex calculations. For this reason, server models often use large L3 caches. Although there are times when this is not enough, and therefore you have to additionally install the so-called L4 modules, which look like a separate microcircuit connected to the motherboard.

How to find out the number of levels and cache size on your processor?

Let's start with the fact that this can be done in 3 ways:

via command line (only L2 and L3 cache);
by searching for specifications on the Internet;
using third party utilities.

If we take as a basis the fact that for most processors L1 is 32 KB, and L2 and L3 can fluctuate widely, we need the last 2 values. To search for them, open the command line through the "Start" (enter the value "cmd" through the search bar).

The system will show a suspiciously high value for L2. You need to divide it by the number of processor cores and find out the final result.

If you are going to search for data on the network, then first find out the exact name of the CPU. Right-click on the "My Computer" icon and select "Properties". In the column "System" there will be an item "Processor", which we actually need. Rewrite its name in the same Google or Yandex and see the value on the sites. For reliable information, it is better to choose the official portals of the manufacturer (Intel or AMD).
The third method also does not cause problems, but requires the installation of additional software such as GPU-Z, AIDA64 and other utilities to study the specifications of the stone. An option for fans of overclocking and swarming in detail.

Results

Now you understand what cache memory is, what its size depends on, and for what purposes an ultra-fast data array is used. At the moment, the most interesting solutions on the market in terms of large amounts of cache memory are AMD Ryzen 5 and 7 devices with their 16 MB L3.

In the following articles, we will cover topics such as processors, the benefits of chips, and more. and stay with us. Until we meet again, bye.

All users are well aware of such elements of a computer as a processor responsible for processing data, as well as random access memory (RAM or RAM) responsible for storing them. But not everyone probably knows that there is also a processor cache (Cache CPU), that is, the RAM of the processor itself (the so-called super-RAM memory).

What is the reason that prompted computer developers to use special memory for the processor? Isn't RAM enough for a computer?

Indeed, for a long time, personal computers did without any kind of cache memory. But, as you know, the processor is the fastest device in a personal computer and its speed has grown with each new generation of CPU. Currently, its speed is measured in billions of operations per second. At the same time, standard RAM has not significantly increased its performance over the course of its evolution.

Generally speaking, there are two main technologies for memory chips - static memory and dynamic memory. Without delving into the details of their structure, we will only say that static memory, unlike dynamic memory, does not require regeneration; in addition, 4-8 transistors are used for one bit of information in static memory, while 1-2 transistors are used in dynamic memory. Accordingly, dynamic memory is much cheaper than static memory, but at the same time much slower. Currently, RAM chips are manufactured on the basis of dynamic memory.

Approximate evolution of the ratio of the speed of processors and RAM:

Thus, if the processor took information from the main memory all the time, then it would have to wait for the slow dynamic memory, and it would be idle all the time. In the same case, if static memory were used as RAM, then the cost of the computer would increase several times.

That is why a reasonable compromise was developed. The main part of the RAM remained dynamic, while the processor got its own fast cache based on static memory chips. Its volume is relatively small - for example, the volume of the L2 cache is only a few megabytes. However, here it is worth remembering that all the RAM of the first IBM PC computers was less than 1 MB.

In addition, the expediency of implementing caching technology is also influenced by the fact that different applications that are in RAM load the processor differently, and, as a result, there is a lot of data that requires priority processing compared to the rest.

History of the cache

Strictly speaking, before the cache memory moved to personal computers, it had been successfully used in supercomputers for several decades.

For the first time, a cache memory of only 16 KB appeared in a PC based on the i80386 processor. Today's processors use various levels of cache, from the first (the fastest cache of the smallest size - usually 128 KB) to the third (the slowest cache of the largest size - up to tens of MB).

At first, the processor's external cache memory was located on a separate chip. Over time, however, this led to the fact that the bus located between the cache and the processor became a bottleneck, slowing down data exchange. In modern microprocessors, both the first and second levels of cache memory are located in the processor core itself.

For a long time, there were only two cache levels in processors, but for the first time in the Intel Itanium CPU, a third-level cache memory appeared, common to all processor cores. There are also developments of processors with a four-level cache.

Architectures and principles of cache operation

To date, two main types of cache memory organization are known, which originate from the first theoretical developments in the field of cybernetics - Princeton and Harvard architectures. The Princeton architecture implies a single memory space for storing data and commands, while the Harvard one has a separate one. Most personal computer processors of the x86 line use a separate type of cache memory. In addition, a third type of cache memory has also appeared in modern processors - the so-called associative translation buffer, designed to speed up the conversion of operating system virtual memory addresses into physical memory addresses.

Simplified, the scheme of interaction between the cache memory and the processor can be described as follows. First, the presence of the information needed by the processor is checked in the fastest - the first-level cache, then - in the second-level cache, and so on. If the necessary information was not found in any level of the cache, then they say about an error, or a cache miss. If there is no information in the cache at all, then the processor has to take it from RAM or even from external memory (from the hard disk).

The order in which the processor searches for information in memory:

This is how the processor searches for information

To control the operation of the cache memory and its interaction with the computing units of the processor, as well as the RAM, there is a special controller.

Scheme of organizing the interaction of the processor core, cache and RAM:

The cache controller is the key link between the processor, RAM and cache.

It should be noted that data caching is a complex process that uses many technologies and mathematical algorithms. Among the basic concepts used in caching, one can single out the methods of writing a cache and the architecture of cache memory associativity.

Cache Write Methods

There are two main methods for writing information to the cache:

The write-back method (writeback) - data is written first to the cache, and then, upon the occurrence of certain conditions, to RAM.
The write-through method (through writing) - data is written simultaneously to RAM and cache.

Cache Associativity Architecture

The cache associativity architecture defines the way in which data from RAM is mapped to the cache. There are the following main variants of caching associativity architecture:

Direct-mapped cache - a specific area of the cache is responsible for a specific area of RAM
Fully associative cache - any cache area can be associated with any RAM area
Mixed cache (set-associative)

Different cache associativity architectures can typically be used at different cache levels. Direct RAM-mapped caching is the fastest caching option, so this architecture is typically used for large caches. In turn, a fully associative cache has fewer cache errors (misses).

Conclusion

In this article, you got acquainted with the concept of cache memory, cache memory architecture and caching methods, and learned how it affects the performance of a modern computer. The presence of cache memory can significantly optimize the performance of the processor, reduce its idle time, and, consequently, increase the performance of the entire system.

Today's article is not an independent material - it simply continues the study of the performance of three generations of the Core architecture on an equal footing (started at the end of last year and continued recently). True, today we will take a small step to the side - the frequencies of the cores and cache memory will remain the same as before, but the capacity of the latter will decrease. Why is this needed? We used a “full” Core i7 of the last two generations for the purity of the experiment, testing it with enabled and disabled support for Hyper-Threading technology, since for a year and a half since the Core i5 has been supplied with not 8, but 6 MiB L3. It is clear that the effect of cache memory capacity on performance is not as great as it is sometimes believed, but it is there, and there is no getting away from it. In addition, Core i5 are more mass-produced products than Core i7, and in the first generation, no one “offended” them in this parameter. But before they were slightly limited in a different way: the UnCore clock speed in the first generation i5 was only 2.13 GHz, so our "Nehalem" is not exactly a representative of the 700th line at 2.4 GHz, but a slightly faster processor . However, we considered it unnecessary to greatly expand the list of participants and rework the testing conditions - anyway, as we have warned more than once, testing this line does not bring any new practical information: real processors work in completely different modes. But for those who want to thoroughly understand all the subtle points, it seems to us, such testing will be interesting.

Test stand configuration

We decided to limit ourselves to only four processors, and there will be two main participants: both quad-core Ivy Bridge, but with different L3 cache capacities. The third is "Nehalem HT": last time it was almost identical to "Ivy Bridge is simple" in terms of the final score. And “just Nehalem” which, as we have already said, is slightly faster than the real Core i5 of the first generation, operating at a frequency of 2.4 GHz (due to the fact that the UnCore frequency was slightly lower in the 700th line), but not too radical. But the comparison is also interesting: on the one hand, there are two steps to improve the microarchitecture, on the other hand, the cache memory is limited. A priori, we can assume that the first will outweigh in most cases, but how much and in general - how the “first” and “third” i5s are comparable (adjusted for the UnCore frequency, of course, although if there are many people who want to see an absolutely accurate comparison, we will later let's do) is already a good topic for research.

Testing

Traditionally, we divide all tests into a number of groups and show the average result for a group of tests/applications on the diagrams (for details on the testing methodology, see a separate article). The results in the diagrams are given in points, for 100 points the performance of the reference test system, the site of the sample of 2011, is taken. It is based on the AMD Athlon II X4 620 processor, but the amount of memory (8 GB) and video card () are standard for all tests of the “main line” and can only be changed as part of special studies. Those who are interested in more detailed information are again traditionally invited to download a table in Microsoft Excel format, in which all the results are shown both in converted points and in "natural" form.

Interactive work in 3D packages

There is some effect of cache memory capacity, but it is less than 1%. Accordingly, both Ivy Bridges can be considered identical to each other, but the architecture improvements allow the new Core i5 to easily overtake the old Core i7 in the same way as the new Core i7 do.

Final rendering of 3D scenes

In this case, of course, no improvements can compensate for the increase in the number of processed threads, but today the most important thing for us is not this, but the complete absence of the effect of cache memory capacity on performance. Here Celeron and Pentium, as we have already established, are different processors, so rendering programs are sensitive to L3 capacity, but only when the latter is not enough. And 6 MiB for four cores, as we see, is quite enough.

Packing and unpacking

Naturally, these tasks are susceptible to the capacity of the cache, but here the effect of increasing it from 6 to 8 MiB is quite modest: about 3.6%. More interesting, in fact, is the comparison with the first generation - architectural improvements allow the new i5 to “smash” even the old i7 at equal frequencies, but this is in the overall standings: due to the fact that two out of four tests are single-threaded, and one is double-threaded. Data compression by 7-Zip is naturally the fastest on Nehalem HT: eight streams are always faster than four of comparable performance. But if we limit ourselves to only four, then our “Ivy Bridge 6M” loses not only to its progenitor, but also to the old Nehalem: microarchitecture improvements completely give in to a decrease in cache memory capacity.

Audio encoding

Somewhat unexpected was not the size of the difference between the two Ivy Bridges, but the fact that it exists at all. The truth is so cheap that it can also be attributed to the features of rounding or measurement errors.

Compilation

Threads are important, but so is cache capacity. However, as usual, not too much - about 1.5%. A comparison with the first generation Core with Hyper-Threading disabled is more interesting: the new Core i5 wins even at an equal frequency, but one of the three compilers (by Microsoft, to be precise) worked on both processors in the same time. Even with a 5 second advantage for the older one - despite the fact that in this program, the "full cache" Ivy Bridge results in 4 seconds better than Nehalem. In general, here it cannot be considered that the decrease in L3 capacity somehow greatly affected the second and third generation Core i5, but there are nuances.

Mathematical and engineering calculations

Again, less than 1% difference with the "older" crystal and again a convincing victory over the first generation in all its forms. Which is more of a rule than an exception for such low-threaded tests, but why not make sure of it once again? Especially in such a refined form, when (unlike tests in normal mode) the difference in frequencies (“standard” or appearing due to Turbo Boost) does not interfere.

Raster graphics

But even with a more complete utilization of multithreading, the picture does not always change. And the capacity of the cache memory does not give anything at all.

Vector graphics

And here it is similar. True, only a couple of computation threads are needed.

Video encoding

Unlike this group, where, nevertheless, even Hyper-Threading does not allow Nehalem to fight on equal terms with followers of newer generations. But they are not too hindered by a decrease in the capacity of the cache memory. More precisely, it practically does not interfere at all, since the difference is again less than 1%.

Office software

As expected, there is no performance gain from increasing the cache capacity (more precisely, its drop from reducing it). Although if you look at the detailed results, you can see that the only multi-threaded test in this group (namely, text recognition in FineReader) is about 1.5% faster at 8 MiB L3 than at 6 MiB. It would seem - what is 1.5%? Practically speaking, nothing. But from a research point of view, it’s already interesting: as you can see, it is multi-threaded tests that most often lack cache memory. As a result, the difference (albeit small) is sometimes even where it should not be. Although there is nothing so inexplicable in this - roughly speaking, in low-threaded tests we have 3-6 MiB per thread, but in multi-threaded tests we get 1.5 MiB in the same place. The first is a lot, but the second may not be quite enough.

Java

However, the Java machine does not agree with this assessment, but this is also understandable: as we have already written more than once, it is very well optimized not for x86 processors at all, but for phones and coffee makers, where there can be many cores, but here is the cache very little memory. And sometimes there are few cores and cache memory - expensive resources both in terms of chip area and power consumption. And, if something can be done with cores and megahertz, then with the cache everything is more difficult: in the quad-core Tegra 3, for example, it is only 1 MiB. It is clear that the JVM can “grunt” even more (like all systems with bytecode), which we have already seen comparing Celeron and Pentium, but more than 1.5 MiB per thread, if it can come in handy, then not in those tasks, which are included in SPECjvm 2008.

Games

We had high hopes for games, because they often turn out to be more demanding than even archivers in terms of cache memory capacity. But it happens when it is very small, and 6 MiB - as we see, is enough. And, again, processors of the quad-core Core level of any generation, even at a frequency of 2.4 GHz, are too powerful a solution for the gaming applications used, so they will obviously not be the bottleneck, but other system components. Therefore, we decided to shake off the dust from modes with low graphics quality - it is clear that for such systems it is too synthetic, but all our testing is synthetic :)

When all sorts of video cards and so on do not interfere, the difference between the two Ivy Bridges already reaches “crazy” 3%: in this case, you can ignore it in practice, but for theory it’s a lot. More came out just only in the archivers.

Multitasking environment

Somewhere we have already seen this. Well, yes - when we tested six-core processors under LGA2011. And now the situation repeats itself: the load is multi-threaded, some of the programs used are “greedy” to the cache memory, but its increase only reduces the average performance. How can this be explained? Unless the fact that arbitrage becomes more complicated and the number of misses increases. Moreover, we note that this happens only when the capacity of L3 is relatively large and there are at least four simultaneously working computation threads - in the budget segment, a completely different picture. In any case, as our recent Pentium and Celeron testing showed, for dual-core processors, increasing L3 from 2 to 3 MiB adds 6% to performance. But four- and six-core does not give, to put it mildly, nothing. Even less than nothing.

Total

A logical overall result: since no significant difference was found anywhere between processors with different L3 sizes, there is none in the "general and whole" either. Thus, there is no reason to be upset about the decrease in cache memory capacity in the second and third generation Core i5 - the predecessors of the first generation are not competitors to them anyway. Yes, and the old Core i7, on average, also demonstrate only a similar level of performance (of course, mainly due to the lag in low-threaded applications - and there are scenarios that they can handle faster under equal conditions). But, as we have already said, in practice, real processors are far from equal in terms of frequencies, so the practical difference between generations is greater than can be obtained in such studies.

Only one question remains open: we had to greatly reduce the clock speed to ensure parity with the first generation Core, but will the observed patterns persist in conditions closer to reality? After all, the fact that four low-speed computing threads do not see the difference between 6 and 8 MiB of cache memory does not mean that it will not be detected in the case of four high-speed ones. True, the opposite does not follow, so in order to finally close the topic of theoretical studies, we need one more laboratory work, which we will do next time.

How important is L3 cache for AMD processors?

Indeed, it makes sense to equip multi-core processors with dedicated memory that will be shared by all available cores. In this role, a fast L3 cache can significantly speed up access to data that is requested most often. Then the cores, if there is such an opportunity, will not have to access the slow main memory (RAM, RAM).

At least in theory. AMD recently announced the Athlon II X4 processor, which is a Phenom II X4 model without L3 cache, hinting that it is not necessary. We decided to directly compare two processors (with and without L3 cache) to see how the cache affects performance.

How does the cache work?

Before we dive into the tests, it's important to understand some basics. The principle of the cache is quite simple. The cache buffers data as close as possible to the processing cores of the processor in order to reduce CPU requests to more distant and slower memory. On modern desktop platforms, the cache hierarchy includes as many as three levels that precede access to RAM. Moreover, the caches of the second and, in particular, the third levels serve not only for data buffering. Their purpose is to prevent CPU bus overload when the cores need to exchange information.

Hits and misses

The effectiveness of the cache architecture is measured by the percentage of hits. Requests for data that can be satisfied by the cache are considered hits. If this cache does not contain the required data, then the request is passed further along the memory pipeline, and a miss is counted. Of course, misses lead to more time it takes to get the information. As a result, "bubbles" (downtime) and delays appear in the computational pipeline. Hits, on the other hand, allow you to maintain maximum performance.

Cache writing, exclusivity, coherence

Replacement policies dictate how cache space is made available for new entries. Since data written to the cache must eventually appear in main memory, systems can do this at the same time as writing to the cache (write-through), or they can mark the area data as "dirty" (write-back), and write to memory when it will be evicted from the cache.

Data in several cache levels can be stored exclusively, that is, without redundancy. Then you will not find identical data lines in two different cache hierarchies. Or caches can work inclusively, that is, the lower levels of the cache are guaranteed to contain data present in the upper levels of the cache (closer to the processor core). AMD Phenom uses an exclusive L3 cache, while Intel follows an inclusive cache strategy. Coherence protocols keep data consistent and up-to-date across cores, cache levels, and even processors.

Cache size

A larger cache can hold more data, but tends to increase latency. In addition, a large cache consumes a considerable number of processor transistors, so it is important to find a balance between the "budget" of transistors, die size, power consumption and performance / latency.

Associativity

Records in RAM can be directly mapped to the cache, that is, there is only one position in the cache for a copy of data from RAM, or they can be n-way associative, that is, there are n possible locations in the cache where this data might be stored. Higher associativity (up to fully associative caches) provides the best caching flexibility because existing data in the cache does not need to be overwritten. In other words, a high n-degree of associativity guarantees a higher hit rate, but it increases latency because it takes more time to test all of these associations for a hit. As a rule, the highest degree of association is reasonable for the last level of caching, since the maximum capacity is available there, and searching for data outside this cache will result in the processor accessing slow RAM.

To give a few examples, the Core i5 and i7 use 32KB of L1 cache with 8-way associativity for data and 32KB of L1 cache with 4-way associativity for instructions. It is understandable that Intel wants instructions to be available faster, and the L1 cache for data has a maximum percentage of hits. Intel's L2 cache has 8-way associativity, while Intel's L3 cache is even smarter because it implements 16-way associativity to maximize hits.

However, AMD is pursuing a different strategy with the Phenom II X4 processors, which uses L1 cache with 2-way associativity to reduce latency. To compensate for possible misses, the cache capacity was doubled: 64 KB for data and 64 KB for instructions. The L2 cache has 8-way associativity, like the Intel design, but AMD's L3 cache works with 48-way associativity. But the decision to choose one cache architecture or another cannot be judged without considering the entire CPU architecture. It is quite natural that test results are of practical importance, and our goal was just a practical test of this entire complex multi-level caching structure.

Almost all developers know that the processor cache is such a small but fast memory that stores data from recently visited areas of memory - the definition is short and quite accurate. Nevertheless, knowledge of the "boring" details about the mechanisms of the cache is necessary to understand the factors that affect the performance of the code.

In this article, we will look at a number of examples illustrating the various features of the caches and their impact on performance. The examples will be in C#, the choice of language and platform does not affect performance assessment and final conclusions so much. Naturally, within reasonable limits, if you choose a language in which reading a value from an array is tantamount to accessing a hash table, you will not get any results suitable for interpretation. Translator's notes are in italics.

Habracut - - -

Example 1: memory access and performance

How much faster do you think the second loop is than the first?

int arr = new int ;
// first
for (int i = 0; i< arr.Length; i++) arr[i] *= 3;
// second
for (int i = 0; i< arr.Length; i += 16) arr[i] *= 3;

The first loop multiplies all array values by 3, the second loop only multiplies every sixteenth value. The second cycle does only 6% work the first cycle, but on modern machines both cycles take approximately the same time: 80ms And 78 ms respectively (on my machine).

The answer is simple - memory access. The speed of these loops is primarily determined by the speed of the memory subsystem, and not by the speed of integer multiplication. As we will see in the next example, the number of accesses to RAM is the same in both the first and second cases.

Example 2: Impact of Cache Lines

Let's dig deeper - let's try other step values, not just 1 and 16:

for (int i = 0; i< arr.Length; i += K /* шаг */ ) arr[i] *= 3;

Here is the running time of this cycle for various values of step K:

Please note that with step values from 1 to 16, the operating time practically does not change. But with values greater than 16, the running time is roughly halved every time we double the step. This does not mean that the loop somehow magically starts to run faster, just that the number of iterations also decreases. The key point is the same running time with step values from 1 to 16.

The reason for this is that modern processors access memory not by bytes, but by small blocks called cache lines. Typically, the string size is 64 bytes. When you read any value from memory, at least one cache line gets into the cache. Subsequent access to any value from this string is very fast.

Because 16 int values occupy 64 bytes, loops with steps from 1 to 16 access the same number of cache lines, more precisely, all the cache lines of the array. At step 32, every second line is accessed, at step 64, every fourth.

Understanding this is very important for some optimization methods. The number of accesses to it depends on the location of data in memory. For example, misaligned data may require two RAM accesses instead of one. As we found out above, the speed of work will be two times lower.

Example 3: First and Second Level Caches (L1 and L2) Sizes

Modern processors typically have two or three levels of caches, commonly referred to as L1, L2, and L3. You can use the CoreInfo utility or the GetLogicalProcessorInfo Windows API function to find out the sizes of caches at various levels. Both methods also provide information about the cache line size for each level.

On my machine, CoreInfo reports 32KB L1 data caches, 32KB L1 instruction caches, and 4MB L2 data caches. Each core has its own personal L1 caches, L2 caches are common for each pair of cores:

Logical Processor to Cache Map: *--- Data Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64 *--- Instruction Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64 -*-- Data Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64 -*-- Instruction Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64 **-- Unified Cache 0, Level 2, 4 MB, Assoc 16, LineSize 64 --*- Data Cache 2, Level 1, 32 KB, Assoc 8, LineSize 64 --*- Instruction Cache 2, Level 1, 32 KB, Assoc 8, LineSize 64 ---* Data Cache 3, Level 1, 32 KB, Assoc 8, LineSize 64 ---* Instruction Cache 3, Level 1, 32 KB, Assoc 8, LineSize 64 --** Unified Cache 1, Level 2, 4 MB, Assoc 16, LineSize 64
Let's check this information experimentally. To do this, let's iterate through our array incrementing every 16th value - an easy way to change the data in each line of the cache. When the end is reached, we return to the beginning. Let's check the different array sizes, we should see a performance drop when the array no longer fits into the caches of different levels.

The code is like this:

int steps = 64 * 1024 * 1024; // number of iterations
int lengthMod = arr.Length - 1; // array size -- power of two
for (int i = 0; i< steps; i++)
{
// x & lengthMod = x % arr.Length because powers of two
arr[(i * 16) & lengthMod]++;
}

Test results:

On my machine, performance drops after 32 KB and 4 MB are noticeable - these are the sizes of the L1 and L2 caches.

Example 4: Instruction Parallelism

Now let's look at something else. Which of the two loops do you think will run faster?

int steps = 256 * 1024 * 1024;
int a = new int ;
// first
for (int i = 0; i< steps; i++) { a++; a++; }
// second
for (int i = 0; i< steps; i++) { a++; a++; }

It turns out that the second loop runs almost twice as fast, at least on all the machines I've tested. Why? Because the commands inside the loops have different data dependencies. The commands of the first one have the following chain of dependencies:

In the second cycle, the dependencies are:

The functional parts of modern processors are capable of performing a certain number of certain operations simultaneously, usually not a very large number. For example, parallel access to data from the L1 cache at two addresses is possible, as well as the simultaneous execution of two simple arithmetic instructions. In the first cycle, the processor cannot use these features, but it can in the second.

Example 5: Cache Associativity

One of the key questions that needs to be answered when designing a cache is whether data from a certain memory area can be stored in any cache cells or only in some of them. Three possible solutions:

Direct Mapping Cache, the data of each cache line in RAM is stored in only one predefined cache cell. The simplest way to compute the mapping is: row_in_index % number_of_cache_cells. Two rows mapped to the same cell cannot be cached at the same time.
N-entry partially associative cache, each line can be stored in N different cache locations. For example, in a 16-way cache, a row can be stored in one of the 16 cells that make up the group. Usually, strings with equal low bits of indices share one group.
Fully associative cache, any string can be stored in any cache location. The solution is equivalent to a hash table in its behavior.

Direct-mapped caches are prone to conflicts, for example, when two rows compete for one cell, alternately pushing each other out of the cache, the efficiency is very low. On the other hand, fully associative caches, although without this disadvantage, are very complex and expensive to implement. Partially associative caches are a typical trade-off between implementation complexity and efficiency.

For example, on my machine, the 4MB L2 cache is a 16-way semi-associative cache. All RAM is divided into sets of lines by the least significant bits of their indices, the lines from each set compete for one group of 16 L2 cache cells.

Since the L2 cache has 65,536 cells (4 * 2 20 / 64) and each group consists of 16 cells, we have a total of 4,096 groups. Thus, the lower 12 bits of the row index determine which group this row belongs to (2 12 = 4 096). As a result, rows with addresses divisible by 262,144 (4,096 * 64) share the same group of 16 cells and compete for a place in it.

For the effects of associativity to work, we need to constantly access a large number of rows from the same group, for example, using the following code:

public static long UpdateEveryKthByte(byte arr, int K)
{
const int rep = 1024 * 1024; // number of iterations
Stopwatch sw = Stopwatch.StartNew();
int p = 0;
for (int i = 0; i< rep; i++)
{
arr[p]++;
P+=K; if (p >= arr.Length) p = 0;
}
Sw.Stop();
return sw.ElapsedMilliseconds;
}

The method increments each K-th element of the array. When we reach the end, we start again. After a fairly large number of iterations (2 20), we stop. I made runs for various array sizes and K step values. Results (blue - long running time, white - small):

The blue areas correspond to those cases when, with constant data change, the cache is not able to accommodate all required data at the same time. Bright blue color indicates an operating time of about 80 ms, almost white - 10 ms.

Let's deal with the blue areas:

Why do vertical lines appear? The vertical lines correspond to the step values at which too many rows (more than 16) from one group are accessed. For these values, my machine's 16-way cache can't hold all the data it needs.
Some of the bad stride values are powers of two: 256 and 512. For example, consider stride 512 and an 8 MB array. At this step, there are 32 sections in the array (8 * 2 20 / 262 144), which are competing with each other for cells in 512 cache groups (262 144 / 512). There are 32 plots, and there are only 16 cells in the cache for each group, so there is not enough space for everyone.
Other stride values that are not powers of 2 are just unlucky, causing a lot of accesses to the same cache groups, and also causing vertical blue lines to appear in the figure. At this point, lovers of number theory are invited to think.
Why do vertical lines break at a 4 MB boundary? With an array size of 4 MB or less, a 16-way cache behaves the same as a fully associative cache, that is, it can fit all the data in the array without conflicts. There are no more than 16 areas fighting for one cache group (262 144 * 16 = 4 * 2 20 = 4 MB).
Why is there a big blue triangle on the top left? Because with a small step and a large array, the cache is not able to fit all the necessary data. The degree of associativity of the cache plays a secondary role here, the limitation is related to the size of the L2 cache.
For example, with an array size of 16 MB and a stride of 128, we access every 128th byte, thus updating every other line of the array cache. It takes 8 MB to store every other line in the cache, but my machine only has 4 MB.
Even if the cache were fully associative, this would not allow 8 MB of data to be stored in it. Note that in the above example with stride 512 and array size 8 MB, we need only 1 MB of cache to store all the data we need, but this is not possible due to insufficient cache associativity.
Why is the left side of the triangle gradually gaining its intensity? The maximum intensity falls on the step value of 64 bytes, which is equal to the size of the cache line. As we saw in the first and second examples, sequential access to the same string costs next to nothing. Let's say, with a step of 16 bytes, we have four memory accesses for the price of one.
Since the number of iterations is the same in our test for any step value, the cheaper step results in less running time.

The detected effects are also preserved at large values of the parameters:

Cache associativity is an interesting thing that can show up under certain conditions. Unlike the other problems discussed in this article, it is not so serious. Certainly, this is not something that requires constant attention when writing programs.

Example 6: False Cache Sharing

On multi-core machines, you may encounter another problem - cache matching. Processor cores have partially or completely separate caches. On my machine, the L1 caches are separate (as usual) and there are two L2 caches shared by each pair of cores. Details may vary, but in general, modern multi-core processors have multi-level hierarchical caches. Moreover, the fastest, but also the smallest caches, belong to individual cores.

When one of the cores modifies the value in its cache, the other cores can no longer use the old value. The value in the caches of other cores must be updated. Moreover, it should be updated entire cache line because caches operate on row-level data.

Let's demonstrate this problem with the following code:

private static int s_counter = new int ;
private void UpdateCounter(int position)
{
for (int j = 0; j< 100000000; j++)
{
s_counter = s_counter + 3;
}
}

If on my four-core machine I call this method with parameters 0, 1, 2, 3 simultaneously from four threads, then the running time will be 4.3 seconds. But if I call the method with parameters 16, 32, 48, 64, then the running time will be only 0.28 seconds.

Why? In the first case, all four values processed by the threads at any given time are very likely to end up in the same cache line. Each time one core increments a value, it marks the cache locations containing that value in other cores as invalid. After this operation, all other cores will have to cache the line again. This renders the caching mechanism inoperable, killing performance.

Example 7: hardware complexity

Even now, when the principles of cache operation are not a secret for you, hardware will still surprise you. Processors differ from each other in optimization methods, heuristics, and other subtleties of implementation.

The L1 cache of some processors can access two cells in parallel if they belong to different groups, but if they belong to the same group, only sequentially. As far as I know, some can even access different quarters of the same cell in parallel.

Processors can surprise you with clever optimizations. For example, the code from the previous example about false cache sharing does not work as intended on my home computer - in the simplest cases, the processor can optimize performance and reduce negative effects. If the code is slightly modified, everything falls into place.

Here is another example of strange iron quirks:

private static int A, B, C, D, E, F, G;
private static void Weirdness()
{
for (int i = 0; i< 200000000; i++)
{
<какой-то код>
}
}

If instead<какой-то код>substitute three different options, you can get the following results:

Incrementing fields A, B, C, D takes longer than incrementing fields A, C, E, G. Stranger still, incrementing fields A and C takes more time than fields A, C And E, G. I do not know exactly what the reasons for this are, but perhaps they are associated with memory banks ( yes, yes, with the usual three-liter memory savings banks, and not what you thought). If you have any thoughts on this, please feel free to comment.

On my machine, the above is not observed, however, sometimes there are abnormally bad results - most likely, the task scheduler makes its own "corrections".

The lesson to be learned from this example is that it is very difficult to fully predict the behavior of hardware. Yes, Can predict a lot, but you need to constantly confirm your predictions through measurements and testing.

Conclusion

I hope that everything discussed has helped you understand the structure of processor caches. Now you can put what you learned into practice to optimize your code.