Template monitoring process. How to limit the number of php-cgi processes for mod

Php-cgi processes consume memory, multiplying exponentially and do not want to die when the FcgidMaxRequestsPerProcess limit expires, after which php-cgi actively starts dumping everything into swap and the system starts to issue "502 Bad gateway ".

To limit the number of forked php-cgi processes, it is not enough to set FcgidMaxRequestsPerProcess, after which processes should be killed, but they do not always do it voluntarily.

The situation is painfully familiar when php-cgi processes ( childa) eating away memory proliferate, but you can't make them die - they want to live! :) Reminds of the problem with overpopulation of the earth by "people" - isn't it?;)

It is possible to settle the eternal imbalance between ancestors and childs by limiting the number of php-cgi childs, their lifetime ( genocide) and control the activity of their reproduction ( contraception).

Limiting the number of php-cgi processes for mod_fcgid

The directives given below play probably the most main role in limiting the number of php-cgi processes and in most cases the default values \u200b\u200bgiven here are detrimental for servers with RAM below 5-10 GB:

FcgidMaxProcesses 1000 - maximum amount processes that can be active at the same time;
FcgidMaxProcessesPerClass 100 - the maximum number of processes in one class ( segment), i.e. the maximum number of processes that are allowed to spawn through the same wrapper ( wrapper - wrapper);
FcgidMinProcessesPerClass 3 - the minimum number of processes in one class ( segment), i.e. the minimum number of processes launched through the same wrapper ( wrapper - wrapper), which will be available after all requests are completed;
FcgidMaxRequestsPerProcess 0 - FastCGI should play the box after completing this number of requests.

What number of php-cgi processes will be the most optimal? To determine the optimal number of php-cgi processes, you need (pub) Register on our website! :)(/ pub) (reg) take into account the total amount of RAM and the amount of memory allocated for PHP in memory_limit ( php.ini), which can be absorbed by each of the php-cgi processes when executing a PHP script. So, for example, if we have 512 MB, 150-200 of which are allocated for the OS itself, another 50-100 for the database server, mail MTA, etc., and memory_limit \u003d 64, then in this case our 200-250 MB We can run 3-4 php-cgi processes simultaneously without much damage. (/ reg)

Php-cgi process life timeout settings

With the active reproduction of php-cgi childs, eating RAM, they can live almost forever, and this is fraught with cataclysms. Below is a list of GMO directives to help cut lifetime for php-cgi processes and timely release the resources they occupy:

FcgidIOTimeout 40 - time ( in sec.) during which the mod_fcgid module will try to execute the script.
FcgidProcessLifeTime 3600 - if the process exists longer than this time ( in seconds), then it will have to be marked for destruction during the next process scan, the interval of which is set in the FcgidIdleScanInterval directive;
FcgidIdleTimeout 300 - if the number of processes exceeds FcgidMinProcessesPerClass, then the process that does not process requests during this time (in seconds), during the next process scan, the interval of which is set in the FcgidIdleScanInterval directive, will be marked for killing;
FcgidIdleScanInterval 120 - the interval at which the mod_fcgid module will search for processes exceeding the FcgidIdleTimeout or FcgidProcessLifeTime limits.
FcgidBusyTimeout 300 - if the process is busy processing requests over this time ( in sec.), then during the next scan, the interval of which is set in FcgidBusyScanInterval, such a process will be marked for killing;
FcgidBusyScanInterval 120 - the interval at which scanning and searching for busy processes that have exceeded the FcgidBusyTimeout limit is performed;
FcgidErrorScanInterval 3 - interval ( in sec.), through which the mod_fcgid module will kill processes awaiting completion, incl. and those that have exceeded FcgidIdleTimeout or FcgidProcessLifeTime. Killing is done by sending a SIGTERM signal to the process, and if the process is still active, then it is killed by the SIGKILL signal.

It should be taken into account that a process exceeding FcgidIdleTimeout or FcgidBusyTimeout can live + more time FcgidIdleScanInterval or FcgidBusyScanInterval, through which it will be marked for destruction.

It is better to set ScanInterval with a difference of several seconds, for example, if FcgidIdleScanInterval is 120, then FcgidBusyScanInterval 117 - i.e. so that the scanning of processes does not occur at the same time.

Php-cgi spawning activity

If none of the above helped, which is surprising, then you can still try to tinker with the spawning activity of php-cgi processes ...

In addition to the limits on the number of requests, php-cgi processes and their lifetime, there is also such a thing as the activity of spawning child processes, which can be regulated by directives such as FcgidSpawnScore, FcgidTerminationScore, FcgidTimeScore and FcgidSpawnScoreUpLimit, the translation of which from the bourgeoisie I think I gave correct ( default values \u200b\u200bare specified): FcgidSpawnScoreUpLimit, no child processes in the application will be spawned, and all spawn requests will have to wait until an existing process is freed or until an evaluation ( Score) does not fall below this limit.

If my translation of the description and understanding of the above parameters is correct, then to reduce the activity of spawning php-cgi processes, you should lower the value of the FcgidSpawnScoreUpLimit directive or increase the FcgidSpawnScore and FcgidTerminationScore values.

Outcome

I hope I have listed and chewed in detail most of the mod_fcgid directives that will help limit the number of php-cgi and their lifetime, as well as lower resource consumption. Below is the complete mod_fcgid config for a successfully working server with a 2500 MHz processor and 512 MB RAM:

Oleg Golovsky

Why does a browser need multiple processes? The multi-process architecture improves security and stability: if something fails, it will not drag everything else along with it.

In fact, the multi-process trick has been used by other browsers for a long time, and much more aggressively than Firefox. For example, Chrome and all Chromium-based browsers (modern Opera, Yandex Browser and others) can show dozens of processes in memory in the task manager if you have a lot of tabs loaded.

There is one serious negative point in this: many processes can heavily load a weak computer, and if you are used to working with a large number of tabs or you have many extensions installed, then a PC with relatively current characteristics can already "strain".

Firefox spawns fewer processes than Chrome?

As we said, Mozilla has approached the issue with multiple processes much more cautiously than Google.

Initially, the developers made only one additional process for Firefox where plugins were displayed (not to be confused with extensions) - plugin-container.exe. So Firefox got 2 processes for the first time.

However, time passed and demanded from the company not to yield to competitors in terms of stability and security. As a result, the long-tested full multiprocessing architecture of Firefox was completed this year.

Firefox does not lose the advantage of lower memory consumption, even if it uses its multiprocessing to the maximum (8 CP - 8 processes for processing content)

Some users are stable firefox versions for the first time were able to evaluate multiprocessing this summer, starting with Firefox 54. The final stage here was the fall release of Firefox 57, which was no longer supported. Some of these extensions could previously block multiprocessing, forcing Firefox to use only one process.

However, with processes in Firefox, things are still not the same as in Chrome. If the brainchild of Google launches literally everything and everyone (every tab, every extension) in separate processes, then Firefox breaks down various elements into groups. As a result, there are not so many processes as with the main competitor.

Hence, noticeably lower memory consumption and, in some cases, lower CPU load. After all, a huge number of processes in Chromium browsers can load not even the weakest processor. But Mozilla eventually came up with a compromise and, in our opinion, the most reasonable solution.

In addition, Firefox uses a different tab-on-demand mechanism than Chrome and Chromium-based browsers.

If these web browsers automatically sequentially load tabs from the previous session in the background, then the "fire fox" does this only when explicitly accessing (clicking) the tab, thereby not creating unnecessary processes when they are not needed. It also contributes to less resource consumption.

How can I reduce the number of Firefox processes?

Unlike Google, Mozilla practically allows the user to control how many processes in memory the browser uses.

You see how several firefox.exe processes (or firefox.exe * 32 in case of using 32-bit versions) are hanging in the task manager and you want to remove / disable them - no problem. Open the settings, scroll down the "general" section, reaching the "performance" subsection:

If you uncheck the Use Recommended Performance Settings option, you will be presented with a setting for the number of content processing processes.

There are options from 1 to 7 processes to choose from (if you have more than 8 GB of memory, then more than 7 processes may be offered):

At this point, it is worth making a few important clarifications.

First, we are talking about the processes for processing content. If you specify here, for example, only 1 process, then the total number of processes in memory will decrease, but you still will not receive only one copy of firefox.exe, because in addition to content, Firefox also outputs interface processing to separate processes.

Secondly, reducing the number of processes makes sense on computers with a small amount of "RAM" and extremely weak hardware. In turn, on more or less acceptable hardware, multiprocessing will not degrade performance, but, on the contrary, will contribute to it, albeit at the cost of increased memory consumption.

Is there any benefit from reducing the number of processes?

If we talk about our own example, then for a PC with 8 GB of RAM, 4 processes were originally proposed for processing content. At the same time, up to 7 processes could be displayed in memory when a large number of tabs were opened.

When we set the number of content processes to 1, restarted the browser and re-clicked all the tabs to load them, predictably only 4 processes remained in memory.

Of these, 3 are intended for the browser itself and 1 process is just for processing content, and the latter is easy to distinguish, because when you open a decent number of tabs, it begins to take up memory for itself much more than the others:

In Firefox, we had 15 different sites open. In the initial mode (7 processes), the total memory consumption was about 1.5 GB. When there were only four processes left, then in total they took about 1.4 GB (see screenshots above).

We repeated the experiment several times, each of them "gain" of RAM was only 100-150 MB. It should be borne in mind that the performance of the browser from switching to 1 process for content could be reduced. Thus, the sense of reducing the number of processes, as you can see, is very small.

17 dec 2010

Here you can download for free.
This material is provided by the site for informational purposes only. The administration is not responsible for its content.

In this article I will describe fundamental differences Apache and Nginx, frontend-backend architecture, apache installation as a backend and Nginx as a frontend. I will also describe a technology that allows you to speed up the work of a web server: gzip_static + yuicompressor.

Nginx - the server is lightweight; it starts the specified number of processes (usually the number of processes \u003d the number of cores), and each process in the loop accepts new connections, processes the current ones. This model allows serving a large number of clients with low resource costs. However, with such a model, it is impossible to perform long-term operations when processing a request (for example mod_php), since this will essentially hang the server. At each cycle within a process, two operations are essentially performed: read a block of data from somewhere, write it somewhere. From somewhere and somewhere is a client connection, a connection to another web server or FastCGI process, file system, a buffer in memory. The work model is configured with two main parameters:
worker_processes is the number of processes to start. Usually set equal to the number of processor cores.
worker_connections is the maximum number of connections processed by one process. Directly depends on the maximum number of open file descriptors on the system (1024 by default in Linux).

Apache - the server is heavy (it should be noted that, if desired, it can be sufficiently lightened, but this will not change its architecture); it has two main models of work - prefork and worker.
When using the prefork model, Apache creates a new process to process each request, and this process does all the work: accepts the request, generates content, gives it to the user. This model is configured with the following parameters:

MinSpareServers - the minimum number of idle processes. This is necessary in order to start processing it faster when a request arrives. The web server will start additional processes to supply the specified number.
MaxSpareServers - The maximum number of idle processes. This is necessary so as not to take up extra memory. The web server will kill unnecessary processes.
MaxClients - the maximum number of concurrently served clients. The web server will not start more than the specified number of processes.
MaxRequestsPerChild - the maximum number of requests that the process will process, after which the web server will kill it. Again, to save memory. memory in the processes will gradually "flow away".

This model was the only one supported by Apache 1.3. It is stable, does not require multithreading from the system, but it consumes a lot of resources and loses a little in speed to the worker model.
When using the worker model, Apache creates multiple processes, with multiple threads in each. Moreover, each request is completely processed in a separate thread. Slightly less stable than prefork, because crashing a thread can crash the entire process, but it runs a little faster and uses fewer resources. This model is configured with the following parameters:

StartServers - sets the number of processes to start when the web server starts.
MinSpareThreads - The minimum number of threads hanging idle in each process.
MaxSpareThreads - The maximum number of threads hanging idle in each process.
ThreadsPerChild - sets the number of threads that each process starts when the process starts.
MaxClients - the maximum number of concurrently served clients. In this case, specifies the total number of threads in all processes.
MaxRequestsPerChild - the maximum number of requests that the process will process, after which the web server will kill it.

Front-end backend

The main problem of Apache is that a separate process (at least a thread) is allocated for each request, which is also hung with various modules and consumes a lot of resources. In addition, this process will hang in memory until it has served all the content to the client. If the client has a narrow channel, and the content is large enough, it can take a long time. For example, the server will generate content in 0.1 seconds, and it will be given to the client for 10 seconds, all this time taking up system resources.
A frontend-backend architecture is used to solve this problem. Its essence is that a client's request comes to a light server, with an architecture like Nginx (frontend), which redirects (proxies) the request to a heavy server (backend). The backend generates content, very quickly gives it to the frontend and frees up system resources. The frontend puts the result of the backend's work in its buffer and can give it (the result) to the client for a long time and persistently, while consuming much less resources than the backend. Additionally, the frontend can independently process requests for static files (css, js, images, etc.), control access, check authorization, etc.

Configuring Nginx (frontend) + Apache (backend) bundle

It is assumed that you already have Nginx and Apache installed. It is necessary to configure the servers so that they listen on different ports. At the same time, if both servers are installed on the same machine, it is better to hang the backend only on the loopback interface (127.0.0.1). In Nginx, this is configured with the listen directive:

In Apache, this is configured with the Listen directive:

Listen 127.0.0.1:81

Next, you need to tell Nginx to proxy requests to the backend. This is done with the proxy_pass directive 127.0.0.1:81 ;. This is all minimal configuration. However, we said above that it is better to instruct Nginx to serve static files. Let's say we have a typical PHP site. Then we need to proxy only requests to the .php files to Apache, processing everything else on Nginx (if your site uses mod_rewrite, then you can also do rewrites on Nginx, and just throw out .htaccess files). It is also necessary to take into account that the client's request comes to Nginx, and the request to Apache is already made by Nginx, so there will be no Host https header, and Apache will define the client's address (REMOTE_ADDR) as 127.0.0.1. It is easy to substitute the Host header, but Apache determines REMOTE_ADDR itself. This problem is solved with mod_rpaf for Apache. It works as follows: Nginx knows the client's IP and adds a certain https header (for example, X-Real-IP), into which it assigns this IP. mod_rpaf gets this header and writes its contents to the Apache REMOTE_ADDR variable. Thus, php scripts executed by Apache will see the real IP of the client.
Now the configuration will get complicated. First, make sure that the same virtual host exists in both Nginx and Apache, with the same root. Example for Nginx:

server (
listen 80;
server_name site;
root / var / www / site /;
}

Example for Apache:

ServerName site

Now we set the settings for the above scheme:
Nginx:
server (
listen 80;
server_name site;
location / (
root / var / www / site /;
index index.php;
}
location ~ \\ .php ($ | \\ /) (
proxy_pass https://127.0.0.1:81;
proxy_set_header X-Real-IP $ remote_addr;
proxy_set_header Host $ host;
}
}

Apache:

# mod_rpaf settings
RPAFenable On
RPAFproxy_ips 127.0.0.1
RPAFheader X-Real-IP

DocumentRoot "/ var / www / site /"
ServerName site

The regular expression \\ .php ($ | \\ /) describes two situations: a request to * .php and a request to * .php / foo / bar. The second option is required for many CMS..com / index.php to work (since we have defined an index file) and will also be proxied to Apache.

Speed \u200b\u200bup: gzip_static + yuicompressor

Gzip on the Web is good. Text files are compressed perfectly, traffic is saved, and content is delivered to the user faster. Nginx can compress on the fly, so there is no problem. However, compressing a file takes a certain amount of time, including CPU time. This is where the Nginx gzip_static directive comes to the rescue. Its essence is as follows: if, when requesting a file, Nginx finds a file with the same name and an additional ".gz" extension, for example, style.css and style.css.gz, then instead of compressing style.css, Nginx will read from disk is already compressed style.css.gz and will give it as compressed style.css.
Nginx settings will look like this:

https (
...
gzip_static on;
gzip on;
gzip_comp_level 9;
gzip_types application / x-javascript text / css;
...

Great, we will generate a .gz file once so that Nginx will give it back many times. In addition, we will compress css and js using YUI Compressor. This utility minimizes css and js files as much as possible, removing spaces, shortening names, etc.
And you can force all this to be compressed automatically, and even update automatically, using cron and a small script. Add the following command to cron to run once a day:

/ usr / bin / find / var / www -mmin 1500 -iname "* .js" -or -iname "* .css" | xargs -i -n 1 -P 2 packfile.sh

in the -P 2 parameter, specify the number of cores of your processor, do not forget to write the full path to packfile.sh and change / var / www to your web directory.
In the packfile.sh file, add.

This is the fourth article in the "Breaking Windows Boundaries" series in which I explore the limitations of fundamental resources in Windows. This time, I'm going to discuss with you the limit on the maximum number of threads and processes supported by Windows. Here I will briefly describe the difference between a thread and a process, the survey thread limitation, and then we will talk about the limitations associated with processes. First of all, I decided to talk about thread restrictions, since each active process has at least one thread (a process that has exited but a reference to which is stored in a handler provided by another process has no threads), so process constraints are directly related to the underlying thread constraints.

Unlike some flavors of UNIX, most Windows resources do not have a fixed constraint built into the operating system at build time, but rather are constrained based on the underlying resources the OS has at its disposal that I discussed earlier. Processes and threads, for example, require for themselves physical memory, virtual memory and memory pool, so the number of processes and threads that can be created on a given Windows system is ultimately determined by one of these resources, depending on how these processes or threads were created and which of the restrictions basic resources will be reached first. Therefore, I recommend that you read my previous articles if you have not done so yet, because further I will refer to concepts such as reserved memory, allocated memory and system memory limit, which I talked about in my previous articles. :

Processes and threads
The Windows process is essentially a container that stores command code from an executable file. It is a kernel process object, and Windows uses this process object and its associated data structures to store and maintain information about the application's executable code. For example, a process has a virtual address space that stores its private and public data and into which the executable image and associated DLLs are mapped. Windows uses diagnostic tools to record information about a process's resource usage for accounting and query execution, and registers process references to operating system objects in the process descriptor table. Processes operate with a security context called a token, which identifies the user account, account groups, and privileges assigned to the process.

A process includes one or more threads that actually execute code in a process (technically, not processes are executed, but threads) and are represented in the system as kernel thread objects. There are several reasons why applications create threads in addition to their original initial thread: 1) processes that have a user interface usually create threads in order to do their job while keeping the main thread responsive to user input commands, and window management; 2) applications that want to use multiple processors to scale performance, or want to keep running while threads stop waiting for I / O to synchronize, create threads to get the additional benefit of multithreading.

Stream limits
In addition to basic information about a thread, including data about the state of the CPU registers, the priority assigned to the thread, and information about the thread's use of resources, each thread has an allocated portion of the process address space, called the stack, that the thread can use as working memory during the execution of the program code. for passing function parameters, storing local variables and addresses of function results. Thus, in order to avoid wasting the virtual memory of the system, initially only part of the stack is allocated, or part of it is transferred to the thread, and the rest is simply reserved. Since the stacks in memory grow downward, the system allocates the so-called "guard" pages (from English guard pages) of memory outside the allocated part of the stack, which provide automatic allocation additional memory (called expanding the stack) when needed. The following illustration shows how the allocated stack area deepens and how the watchdog pages move as the stack expands in a 32-bit address space:

The Portable Executable (PE) executable image structures define the amount of address space that is reserved and initially allocated for the thread's stack. By default, the linker reserves 1MB and allocates one page (4KB), but developers can change these values \u200b\u200beither by changing the PE values \u200b\u200bwhen they communicate with their program, or by calling the CreateTread function on a separate thread. You can use a utility such as Dumpbin that comes with Visual Studio to view the settings for an executable program. Here are the results of running Dumpbin with the / headers option for an executable program generated by a new Visual Studio project:

Translating numbers from hexadecimal system calculus, you can see that the size of the stack reserve is 1MB, and the allocated memory area is 4KB; Using a new utility from Sysinternals called MMap, you can connect to this process and view its address space, and thus see the initially allocated memory page of the process's stack, the watchdog page, and the rest of the reserved stack memory:

Since each thread consumes a portion of a process's address space, processes have a base limit on the number of threads they can create, equal to the size of their address space divided by the size of the thread's stack.

Limitations of 32-bit streams
Even if the process had no code or data at all and the entire address space could be used for stacks, then a 32-bit process with a default address space of 2 bytes could create a maximum of 2048 threads. Here are the results of the Testlimit program running in 32-bit Windows with the -t (thread creation) parameter, confirming this limitation:

Once again, since part of the address space was already used for code and initial heap, not all 2GB were available for thread stacks, so the total number of threads created could not reach the theoretical limit of 2048 threads.

I tried running Testlimit with additional optionproviding the application with an extended address space, hoping that if it is given more than 2GB of address space (for example, on 32-bit systems this is achieved by running the application with the / 3GB or / USERVA option for Boot.ini, or the equivalent BCD option on Vista and later increaseuserva), it will use it. 32-bit processes are allocated 4GB of address space when they run on 64-bit Windows, so how many threads can a 32-bit Testlimit run on 64-bit Windows create? Based on what we've already discussed, the answer should be 4096 (4GB divided by 1MB), but in practice this number is much lower. Here is a 32-bit Testlimit running on 64-bit Windows XP:

The reason for this discrepancy lies in the fact that when you run a 32-bit application on 64-bit Windows, it is actually a 64-bit process that executes 64-bit code on behalf of 32-bit streams, and therefore in memory for each thread areas for 64-bit and 32-bit stream stacks are reserved. For a 64-bit stack, 256Kb are reserved (exceptions are OSs released before Vista, in which original size stack of 64-bit streams is 1MB). Since each 32-bit stream begins its existence in 64-bit mode and the stack size allocated to it at startup exceeds the page size, in most cases you will see that at least 16Kb is allocated for the 64-bit thread stack. Here is an example of 64-bit and 32-bit stacks of a 32-bit stream (32-bit stack is marked as "Wow64"):

32-bit Testlimit was able to create 3204 threads in 64-bit Windows, which is explained by the fact that each thread uses 1MB + 256KB of address space for the stack (again, the exception is Windows versions before Vista, which use 1MB + 1MB). However, I got a different result by running 32-bit Testlimit on 64-bit Windows 7:

The differences between the results on Windows XP and Windows 7 are caused by the more chaotic nature of the address space allocation scheme in Windows Vista, Address Space Layout Randomization (ASLR), which leads to some fragmentation. Randomizing DLL loading, thread stack, and heap allocation helps improve malware protection. As you can see in the following snapshot of VMMap program, in test system there is still 357MB of available address space, but the largest free block is 128KB, which is less than the 1MB required for a 32-bit stack:

As I noted, the developer can reset the default stack reserve size. One of the possible reasons for this may be the desire to avoid wasting address space when it is known in advance that the thread's stack will always use less than the default 1MB. The Testlimit PE image uses a 64KB stack reserve by default, and when you specify the -n parameter with the -t parameter, Testlimit creates threads with 64KB stacks. Here is the result of this utility running on a system with 32-bit Windows XP and 256MB of RAM (I specifically ran this test on a weak system to emphasize this limitation):

It should be noted here that another error has occurred, which means that the address space is not the cause in this situation. In fact, 64Kb stacks should support approximately 32,000 threads (2GB / 64Kb \u003d 32768). So what limitation manifested itself in this case? If we look at possible candidates, including allocated memory and pool, then they do not give any clues in finding an answer to this question, since all these values \u200b\u200bare below their limits:

We can find the answer in additional information about memory in the kernel debugger, which will indicate to us the required limit related to the available resident memory, the entire amount of which has been exhausted:

Available resident memory is the physical memory allocated for data or code that must be in RAM. The sizes of the nonpaged pool and nonpaged drivers are calculated independently of this, as well as, for example, the memory reserved in RAM for I / O operations. Each thread has both user-mode stacks, as I mentioned earlier, but they also have a privileged-mode (kernel-mode) stack, which is used when threads are running in kernel-mode, for example, executing system calls. When a thread is active, its kernel stack is pinned in memory, so that the thread can execute code in the kernel for which required pages cannot be absent.

The base kernel stack is 12KB on 32-bit Windows and 24KB on 64-bit Windows. 14225 threads require approximately 170MB of resident memory for themselves, which exactly corresponds to the amount of free memory on this system with Testlimit disabled:

Once the limit of available system memory is reached, many basic operations start to fail. For example, here is the error I got by double clicking on the shortcut Internet Explorerlocated on the desktop:

As expected, running on 64-bit Windows with 256MB of RAM, Testlimit was able to create 6,600 threads - about half of how many threads this utility could create on 32-bit Windows with 256MB of RAM - before it ran out of available memory:

The reason I previously used the term "base" kernel stack is because the graphics and windowing thread gets the "big" stack when it makes the first call that is (or larger) 20KB in size. 32-bit Windows and 48Kb on 64-bit Windows. Testlimit threads do not call any such API, so they have basic kernel stacks.
Limitations of 64-bit streams

Like 32-bit streams, 64-bit streams have 1MB of stack reserves by default, but 64-bit streams have a lot more user address space (8TB) so it shouldn't be a problem when it comes to creating a large number of threads. Yet it is clear that resident available memory is still a potential limiter. The 64-bit version of Testlimit (Testlimit64.exe) was able to create with and without the -n parameter about 6600 threads on a system with 64-bit Windows XP and 256MB of RAM, exactly the same as the 32-bit version created, because the limit was reached resident available memory. However, on a system with 2GB of RAM, Testlimit64 was able to create only 55,000 threads, which is significantly less than the number of threads that this utility could create if the limitation was resident available memory (2GB / 24Kb \u003d 89,000):

In this case, the cause is the thread's allocated initial stack, which results in the system running out of virtual memory and an error related to insufficient paging file size. As soon as the allocated memory reaches the size of the main memory, the rate of creation of new threads is significantly reduced, because the system begins to "slip", the previously created thread stacks begin to be paged out to the paging file to make room for the stacks of new threads, and the paging file must grow. With the -n option enabled, the results are the same since the initial stack memory allocation remains the same.

Process limitations
Obviously, the number of processes supported by Windows should be less than the number of threads, because each process has one thread and the process itself wastes resources. 32-bit Testlimit running on a system with 64-bit Windows XP and 2GB of system memory creates about 8400 processes:

If you look at the result of the kernel debugger, it becomes clear that in this case the limitation of the resident available memory is reached:

If a process were to use resident available memory to house just the stack of a privileged mode thread, Testlimit could create many more than 8400 threads on a 2GB system. The amount of resident available memory on this system without running Testlimit is 1.9 GB:

By dividing the amount of resident memory used by Testlimit (1.9GB) by the number of processes it creates, we find that 230KB of resident memory is allocated for each process. Since the 64-bit kernel stack is 24KB, we get about 206KB missing for each process. Where is the rest of the resident memory used? When the process is created, Windows reserves enough physical memory to provide the minimum working set. This is done to ensure that the process has enough physical memory in any situation to store the amount of data needed to provide the minimum working set of pages. The default working set size is often 200Kb, which can be easily verified by adding the Minimum Working Set column in the Process Explorer window:

The remaining 6Kb is the resident available memory allocated for additional nonpageable memory, in which the process itself is stored. A 32-bit Windows process uses slightly less resident memory because its thread privileged stack is smaller.

As with user-mode thread stacks, processes can override their default working set size by using the SetProcessWorkingSetSize function. Testlimit supports the -n parameter, which, in conjunction with the -p parameter, allows the child processes of the main Testlimit process to be set to the minimum possible working set size of 80Kb. Because the child processes take time to shrink their working pagesets, Testlimit, after it can no longer create processes, pauses and tries to continue, giving its child processes a chance to execute. Testlimit launched with the -n parameter on a system with Windows 7 and 4GB of RAM is different from the limit of the resident available memory, the limit - the limit of the allocated system memory:

In the snapshot below, you can see that the kernel debugger reports not only that the allocated system memory limit has been reached, but that, after reaching this limit, there have been thousands of memory allocation errors, both virtual and memory. allocated for the paged pool (the limit of the allocated system memory was actually reached several times, since when an error occurred related to the lack of paging file size, this same size increased, pushing this limit):

Prior to running Testlimit, the average memory allocation level was approximately 1.5GB, so threads took up about 8GB of allocated memory. Hence, each process consumed approximately 8GB / 6600 or 1.2MB. The result of executing the command! Vm of the kernel debugger, which shows the allocation of private memory for each process, confirms the correctness of this calculation:

The initial amount of memory allocated for the thread stack, described earlier, has little effect on the rest of the memory allocation requests required for the process address space data structures, page table entries, descriptor tables, process and thread objects, and the native data that the process creates during its initialization.

How many processes and threads are sufficient?
Thus, the answers to the questions "how many threads does Windows support?" and "how many processes can you run simultaneously on Windows?" interrelated. In addition to the nuances of how threads determine their stack size and processes determine their minimum working set of pages, the two main factors determining the answers to these questions for any given system are the amount of physical memory and the limitation of allocated system memory. In any case, if an application creates a sufficient number of threads or processes to approach these limits, then its developer should rethink the design of this application, since there are always different ways achieve the same result with a reasonable number of processes. For example, the main goal when scaling an application is to keep the number of running threads equal to the number of CPUs, and one way to achieve this is to move from using synchronous I / O operations to asynchronous I / O using completion ports, which should help keep the number of running threads aligned with the number of CPU.

It's no secret that almost any development automated system begins by defining the format of the input and output data. Data can vary significantly in structure and organization. Some can have multiple links, while others are just an array of simple data types.

We are primarily interested in two approaches to storing and working with SQL and NoSQL data.

SQL (Structured Query Language) is a structured query language used to create, modify and manipulate data in relational databasesah data based on relational model data. I think it's not worth dwelling on in detail on the approach of the same name, since this is the first thing anyone encounters when studying databases.

NoSQL (not only SQL, not only SQL) - a number of approaches aimed at implementing database models that differ significantly from the tools sQL languagecharacteristic of traditional relational databases.

Term NoSQL was coined by Eric Evans when Joan Oscarson of Last.fm wanted to host an event to discuss open source distributed databases.

The NoSQL concept is not a complete denial of the SQL language and the relational model. NoSQL is an important and useful tool, but not a universal tool. One of the problems of classic relational databases is the difficulty when working with very large data and in highly loaded systems. The main goal of NoSQL is to expand the capabilities of the database where SQL is not flexible enough, does not provide the required performance, and does not supplant it where it meets the requirements of a particular task.

In July 2011, Couchbase, the developer of CouchDB, Memcached and Membase, announced the creation of a new SQL-like query language - Unstructured Data Query Language (UnQL). SQLite creator Richard Hipp and CouchDB project founder Damien Katz performed the work on creating the new language. Development transferred to the community on public domain rights

Using the NoSQL approach will be useful to us for storing huge arrays of simple unstructured information that does not require communication with other data. An example of such information is a multimillion-dollar list of cache or image files. At the same time, we will get significant performance gains compared to the relational approach.

NoSQL systems

NoSQL DBMS

Let's define the concepts.

Scalability - automatic distribution of data across multiple servers. We call such systems distributed databases. These include Cassandra, HBase, Riak, Scalaris, and Voldemort. If you are using a volume of data that cannot be handled on a single machine, or if you do not want to manually manage the distribution, then this is what you need.

You should pay attention to the following things: support for multiple data centers and the ability to add new machines to a running cluster transparently for your applications.

	Adding a machine to a cluster transparently	Support for multiple data centers




		Need to finalize with a file

Unallocated databases include: CouchDB, MongoDB, Neo4j, Redis and Tokyo Cabinet ... These systems can be used as "interlayers" for distributed systems.

Data and Query Model

There are a huge number of data models and query APIs in NoSQL databases.

	Data model	API requests
	Column family
	Documents
	Column family
	Documents

	Collections
	Documents	Nested hashes, REST
	Key / Value
	Key / Value
	Key / Value

The column family system is used in Cassandra and HBase. On both systems, you have rows and columns, but the number of rows is not large: each row has a variable number of columns and the columns do not need to be predefined.

The key / value system is simple and not difficult to implement, but not efficient if you are interested in querying or updating only a portion of the data. It is also difficult to implement complex structures on top of this type of distributed system.

Document databases are essentially the next layer of key / value systems, allowing you to associate nested data with a key. Supporting such queries is more efficient than just returning the entire value.

Neo4J has a unique data model that describes objects in the form of nodes and edges of a graph. For queries that fit this model (for example, hierarchical data), performance can be orders of magnitude higher than for alternatives.

Scalaris is unique in its use of distributed transactions across multiple keys.

Storage System

This is the view in which data is presented in the system.

	Data type
	Memtable / SSTable
	Append-only B-Tree
	Memtable / SSTable on HDFS

	On-disk linked list
	In-memory with background snapshots



	Pluggable (primarily BDB MySQL)

The storage system can help you estimate workloads.

Systems that store data in memory are very, very fast, but they cannot handle data larger than the available RAM. The retention of such data in the event of a power failure or power outage can be a problem. The amount of data that can be expected to be written to can be very large. Some systems, such as Scalaris, solve this problem using replication, but Scalaris does not support scaling across multiple datacenters.

Memtables / SSTables buffer write requests in memory (Memtable), and after writing are added to the log. After accumulating enough records, the Memtable is sorted and written to disk as an SSTable. This allows you to achieve performance close to that of RAM, while at the same time getting rid of problems that are relevant when storing only in memory.

B-trees have been used in databases for a very long time. They provide strong indexing support, but performance is very poor when used on machines with magnetic hard drives.
Interesting is the use of B-trees in CouchDB, only with the append-only B-Trees function, which allows you to get good performance when writing data to disk. This is achieved by the fact that the binary tree does not need to be rebuilt when an element is added.

The project deserves separate consideration Memcached , which became the progenitor for many other systems.

Memcached - software for caching data in the server's RAM based on the hash table paradigm. It is a high-performance distributed in-memory object caching system designed for high-load Internet systems.

A hash table is a data structure that implements an associative array interface that allows you to store pairs (key, value) and perform three operations: an operation to add a new pair, a search operation, and an operation to delete a pair by key.

Using the client library Memcached allows you to cache data in RAM from a variety of available servers. Distribution is implemented by segmenting the data by the key hash value, similar to hash table sockets. The client library uses the data key to compute the hash and uses it to select the appropriate server. A server failure situation is interpreted as a cache miss, which makes it possible to increase the fault tolerance of the complex by increasing the number of memcached servers and the ability to hot swap them.

Memcached has many modifications, developed in several projects: Mycached, membase, Memcache, MemcacheQ, Memcacheddb and libMemcached .

The central project, the locomotive of the NoSQL concept, is Memcached. One of the most significant drawbacks of Memcached is that the cache itself is a highly unreliable storage location. To eliminate this shortcoming, additional solutions are called: Memcacheddb and membase. At the interface (API) level, these products are fully compatible with Memcached. But here, when data expires, it is flushed to disk (db checkpoint strategy). As such, they are a prime example of "non-relational databases" - persistent (long-term) distributed storage systems in the form of key-value pairs.

The next Memcached-based product is MemcacheQ. It is a message queuing system that uses a very simplified API from Memcached. MemcacheQ forms a named stack where you can write your messages, and the data itself is physically stored in the BerkeleyDB database (similar to Memcacheddb), therefore, safety, replication, etc. are ensured.

LibMemcached is a well-known client library written in C ++ for working with the already standard Memcached protocol.

All non-relational storages, made in the form of a distributed system and storing key-value pairs, can be divided into two types: persistent and unstable. Sustainable (eg, MemcachedB, membase, Hypertable) - write data to disk, ensuring their safety in the event of a failure. Unstable (classic Memcached) - store keys in volatile memory and do not guarantee their safety. It is reasonable to use unstable storages for caching and reducing the load on persistent ones - this is their inseparable connection and their main advantage.

Persistent storages are already full-fledged NoSQL databases that combine the speed of Memcached and allow you to store more complex data.

Frontend + backend scheme

The most common scheme, in which a fast and lightweight web server (for example, Nginx) acts as a frontend, and Apache works as a backend.

Let's take a look at the advantages of such a scheme with an example. Imagine that the Apache web server needs to serve about 1000 requests simultaneously, and many of these clients are connected to the Internet over slow communication channels. In the case of using Apache, we will get 1000 httpd processes, each of which will be allocated RAM, and this memory will be occupied until the client receives the requested content or an error occurs.

In the case of using the frontend + backend scheme, after the client's request arrives, Nginx sends the request to Apache and quickly receives a response. And in the case of static content (html, pictures, cache files, etc.), Nginx will generate a response on its own without disturbing Apache. If we still need to execute the logic (php script), Apache, after it has done this and gave a response, Nginx frees memory, then the Nginx web server interacts with the client, which is just intended to serve static content to a large number of clients, when low consumption of system resources. Coupled with competent caching, we get a tremendous economy of server resources and a system that can rightfully be called highly loaded.

Workhorses

Apache - performance optimization

For the frontend + backend scheme, Apache performance is not so acute, but if every microsecond of processor time and every byte of RAM is dear to you, then you should pay attention to this issue.

The coolest way - to increase server performance - is to install a faster processor (s) and more memory, but we will take a less radical path to begin with. Let's speed up Apache by optimizing its configuration. There are configurations that can only be applied when rebuilding Apache, others can be applied without recompiling the server.

Load only required modules

Most of Apache's functionality is implemented using modules. In this case, these modules can be either "sewn" into one or another assembly, or loaded as dynamic libraries (DSO). Most modern distributions ship Apache with a DSO set, so unnecessary modules can be disabled without recompiling.

By reducing the number of modules, you will reduce the amount of memory consumed. If you decide to compile Apache yourself, then either carefully choose the list of modules you want to include, or compile them as DSOs using apxs in Apache1 and apxs2 in Apache2. To disable unnecessary DSO modules, just comment out the extra LoadModule lines in httpd.conf. If you compile modules statically, Apache will consume a little less memory, but you have to recompile it each time to disable or enable a particular module.

Choose the right MPM

Apache allocates its own process or thread to process a request. These processes operate in accordance with one of the MPM (Multiprocessor Model). The choice of MPM depends on many factors, such as thread support in the OS, free memory, and stability and security requirements.

For security over performance, choose peruser or Apache-itk. If performance is more important, take a look at prefork or worker.

Name	Developer	Supported OS	Description	Appointment
	Apache Software Foundation		Hybrid multiprocessor-multi-threaded model. While maintaining the stability of multiprocessor solutions, it can serve a large number of clients with minimal resource use.	Medium loaded web servers.	Stable.
	Apache Software Foundation		MPM, based on the pre-creation of separate processes, without using the threads mechanism.	Greater security and stability due to the isolation of processes from each other, maintaining compatibility with old libraries that do not support threads.	Stable.
	Apache Software Foundation		A hybrid model with a fixed number of processes.	Highly loaded servers, the ability to run child processes using a different username to increase security.	Under development, unstable.
	Apache Software Foundation		Multi-threaded model optimized for NetWare environments.	Novell NetWare servers	Stable.
	Apache Software Foundation	Microsoft Windows	A multi-threaded model created for the Microsoft Windows operating system.	Servers running Windows Server.	Stable.
	Steinar H. Gunderson		MPM based on the prefork model. Allows starting each virtual host under separate uid and gid.	Hosting servers, servers critical to user isolation and resource accounting.	Stable.
	Sean gabriel heacock		Model based on MPM perchild. Allows starting each virtual host under separate uid and gid. Doesn't use streams.	Enhanced security, working with libraries that do not support threads.

To change MPM, you need to recompile Apache. For this it is more convenient to take a source-based distribution kit.

DNS lookup

The HostnameLookups directive enables reverse DNS queries, while dns hosts of clients are written to the logs instead of ip addresses. This significantly slows down the processing of the request, since the request will not be processed until a response is received from the DNS server. Make sure this directive is always off (HostnameLookups Off). If dns addresses are needed, you can run the log in the logresolve utility.

In addition, make sure that the Allow from and Deny From directives use ip-addresses and not domain names... Otherwise, Apache will make two dns requests (reverse and forward) to find out the ip and make sure the client is valid.

AllowOverride

If the AllowOverride directive is not set to None, Apache will try to open .htaccess files in every directory it visits and in all directories above it. For instance:

DocumentRoot / var / www / html

AllowOverride all

If /index.html is requested, Apache will try to open (and interpret) /.htaccess, /var/.htaccess, /var/www/.htaccess, and /var/www/html/.htaccess files. Obviously, this increases the processing time of the request. So if you only need .htaccess for one directory, only allow it for that directory:

DocumentRoot / var / www / html

AllowOverride None

AllowOverride all

FollowSymLinks and SymLinksIfOwnerMatch

If the FollowSymLinks option is enabled for a directory, Apache will follow symbolic links in that directory. If the SymLinksIfOwnerMatch option is enabled, Apache will follow symbolic links only if the owner of the file or directory that the link points to is the same as the owner of the specified directory. Therefore, with the SymLinksIfOwnerMatch option enabled, Apache makes more system requests. In addition, additional system requests are required when FollowSymlinks is not defined. Therefore, it is most optimal for performance to enable the FollowSymlinks option, of course if the security policy allows it.

Content Negotiatio

A mechanism defined in the HTTP specification that allows the service different versions document (resource representation) for the same URI so that the client can determine which version best suits its capabilities. When a client sends a request to the server, it tells the server what file types it understands. Each type has a rating that describes how well the client understands it. Thus, the server is able to provide the version of the resource that best suits the needs of the client.

It is not hard to see how this can affect server performance, so avoid using content negotiaion.

MaxClients

The MaxClients directive sets the maximum number of parallel requests that the server will support. The MaxClient value must not be too small, or many clients will be rejected. Do not set too many - it threatens to run out of resources and "crash" the server. Roughly, MaxClients \u003d the amount of memory allocated for the web server / maximum size spawned process or thread. For static files Apache uses about 2-3 MB per process, for dynamics (php, cgi) it depends on the script, but usually about 16-32 MB. If the server is already serving MaxClients of requests, new requests are sent to the queue, the size of which is set by the ListenBacklog directive.

MinSpareServers, MaxSpareServers, and StartServers

Creating a thread, and especially a process, is a resource-intensive operation, so Apache creates them in reserve. The MaxSpareServers and MinSpareServers directives set the minimum and maximum number of processes / threads that must be ready to accept a request. If the MinSpareServers value is too low and there are a lot of requests, Apache will start creating many new processes / threads, which will create unnecessary load at peak times. If MaxSpareServers are too large, Apache will overload the system, even if the number of requests is minimal.

Empirically, you need to select such values \u200b\u200bso that Apache does not create more than 4 processes / threads per second. If it creates more than 4, a corresponding entry will be made in the ErrorLog - a signal that MinSpareServers should be increased.

MaxRequestsPerChild

The MaxRequestsPerChild directive defines the maximum number of requests that one child process / thread can handle before it exits. By default, the value is set to 0, which means it will never complete. It is recommended to set MaxRequestsPerChild equal to the number of requests per hour. This will not create unnecessary load on the server and, at the same time, will help to get rid of problems with memory leaks in child processes (for example, if you are using an unstable version of php).

KeepAlive and KeepAliveTimeout

KeepAlive allows multiple requests to be made on a single TCP connection. When using the frontend + backend scheme, these directives are not relevant.

HTTP compression

All modern clients and almost all servers now support HTTP compression. The use of compression allows to reduce the traffic between the client and the server up to 4 times, while increasing the load on the server processor. But, if the server is visited by many clients with slow channels, compression can reduce the load by reducing the transmission time of the compressed response. At the same time, the resources occupied by the child process are released faster, and the number of concurrent requests decreases. This is especially noticeable under memory constraints.

Note that you should not set the gzip compression ratio more than 5, since the load on the processor increases significantly, and the compression ratio increases slightly. Also, you should not compress files, the format of which already implies compression - these are almost all multimedia files and archives.

Client side caching

Don't forget to set Expires headers for static files (mod_expires module). If the file does not change, you should always instruct the client to cache it. In this case, the client will load pages faster, and the server will get rid of unnecessary requests.

Disable logs

Disabling logs helps to temporarily cope with the load at peak times. This measure significantly reduces the load on all types of software and is a universal tool in a critical situation. Naturally, with obvious drawbacks, it cannot be recommended for use and serves only as a temporary solution to the problem.

Nginx

A simple and lightweight web server specifically designed to handle static requests. The reason for its performance is that worker processes serve multiple connections concurrently, multiplexing them with select, epoll (Linux), and kqueue (FreeBSD) operating system calls. The server has an efficient pooling memory management system. The response to the client is formed in buffers that store data either in memory or point to a section of the file. The buffers are chained together to define the sequence in which data will be sent to the client. If the operating system supports efficient I / O operations such as writev and sendfile then Nginx, applies them whenever possible.

When used in conjunction with Apache, Nginx is configured to handle static and is used for load balancing. The vast majority of time is engaged only in serving static content, doing it very quickly and with minimal overhead.

Lighttpd

"A web server designed for speed, security, and standards compliance." - wikipedia

It is an alternative to Nginx and is used for the same purposes.

PHP accelerators

The principle of operation of such products is that they cache the script bytecode and reduce the load on the PHP interpreter.

Existing solutions

The Alternative PHP Cache - was conceived as a free, open source and stable framework for caching and optimizing PHP source code. Supports PHP4 and PHP5 including 5.3.

eAccelerator is a free open source project that also serves as an accelerator, optimizer and unpacker. Has built-in dynamic content caching functionality. It is possible to optimize PHP scripts. Supports PHP4 and PHP5 including 5.3.

PhpExpress is a free PHP script processing accelerator on a web server. Also provides support for downloading files encoded through Nu-Coder. Supports PHP4 and PHP5 including 5.3

XCache supports PHP4 and PHP5 including 5.3. Since version 2.0.0 (release candidate from 2012-04-05) PHP 5.4 support is included.

Windows Cache Extension for PHP - PHP accelerator for Microsoft IIS (BSD License). Supports PHP only (5.2 and 5.3).

Caching logic

"Cache, cache and cache again!" - this is the motto of a highly loaded system.

Let's imagine an ideal high-load site. The server receives an http request from the client. Frontend matches the request with a physical file on the server and, if one exists, issues it. We will omit the loading of scripts and images, since it is mostly static and is given according to the same principle. Further, if the file does not physically exist, the frontend makes this request to the backend, which handles the processing of logic (php scripts, etc.). The backend has to decide whether to cache this request and create a file in a certain place, which will be given to the frontend in the future. Thus, we have permanently cached this request and the server will process it as quickly as possible with the lowest possible load on the server.

This ideal example is suitable for pages whose content does not change over time, or rarely changes. In practice, we have pages, the content of which can change with each subsequent request. Rather, part of this content. An example of such content is user information, which should change with an imperceptible delay for the user, or be displayed in real time (updated with each page reload). Here we face the task, which boils down to separating dynamic and static data on the page.

The most convenient and common way to divide data is to divide the page into blocks. This is logical and convenient, because the page, in terms of layout, consists of blocks. Naturally, it will not work to avoid logic in this case, but this logic will be processed at the lowest cost.

Thus, the client's request (except for the static request) is redirected to the backend and its processing is reduced to the following actions:

Getting information about the blocks that will be on the page.
Checking cache information for each block. The cache may not exist or need to be updated. In this case, we generate a cache file. If the block should not be cached, we execute the appropriate logic. Cache information can be stored in a nosql database or in a file structure. There is only one requirement: obtaining this information should take a minimum of resources.
We form html page... We embed the cached blocks using the ssi instruction (a link to the cache file is inserted), which will significantly save memory.
The page goes to the frontend, which replaces all ssi instructions with the contents of the files and gives the page to the client.

It is also common to cache the results of a function or class method. At the same time, we pass the caching function a reference to the object (if we call a method), the name of the method or function (if it is a global function) and the parameters intended for this method or function. The caching function will check for the existence of a cache file, generate or read it if necessary, and then return the result.

it general description the principle of a high-load site. Specific implementation will differ in details, but the concept will remain the same.

Pictures, picchi, bedside tables

It turns out that the image can also be cached. What for? You ask. Basically, after uploading to the server, we already have a file that frontend will quickly spit out if necessary. But often we need to get another image based on an existing picture (for example, of other sizes). Let's say we want a thumbnail of an image - a thumbnail. In this case, we just need to generate the path to the existing thumbnail image file and give the page to the client.

The client, having received the source code of the page, starts loading statics and makes a request for a still-nonexistent picture to the frontend.
Frontend redirects requests for non-existent images to the backend.
The backend parses the request, generates an image file, and sends binary data with the appropriate http header.
All subsequent requests will be sent by the frontend.