The essence and structure of databases. Relational databases

A data model is a collection of data structures and operations for their processing. Using the data model, you can visualize the structure of objects and the relationships established between them. The terminology of data models is characterized by the concepts of "data element" and "binding rules". A data item describes any set of data, and binding rules define the algorithms for the relationship of data items. To date, many different data models have been developed, but in practice, three main ones are used. Allocate hierarchical, network and relational data models. Accordingly, they talk about hierarchical, network and relational DBMS.

О Hierarchical data model. Hierarchically organized data is very common in everyday life. For example, the structure of a higher education institution is a multilevel hierarchical structure. A hierarchical (tree-like) database consists of an ordered set of elements. In this model, the original elements generate other elements, and these elements in turn generate the following elements. Each child has only one parent.

Organizational charts, lists of materials, table of contents in books, project plans, and many other collections of data can be presented in a hierarchical fashion. Integrity of links between ancestors and descendants is automatically maintained. As a general rule, no descendant can exist without its parent.

The main disadvantage of this model is the need to use the hierarchy that was laid in the basis of the database during design. The need for constant data reorganization (and often the impossibility of this reorganization) led to the creation of a more general model - network.

About Network Data Model. The networked approach to organizing data is an extension of the hierarchical approach. This model differs from the hierachic one in that each generated element can have more than one parent element. ■

Since a network database can directly represent all types of relationships inherent in the data of the corresponding organization, this data can be navigated, explored and queried in all sorts of ways, that is, the network model is not connected by just one hierarchy. However, in order to compose a query to a networked database, it is necessary to delve deeply into its structure (to have a schema of this database at hand) and develop a mechanism for navigating the database, which is a significant drawback of this database model.

About the relational data model. The basic idea behind a relational data model is to represent any dataset as a two-dimensional table. In the simplest case, a relational model describes a single two-dimensional table, but most often this model describes the structure and relationships between several different tables.

Relational data model

So, the purpose of the information system is to process dataabout objectsthe real world, given connectionsbetween objects. In DB theory, data is often called attributes, andobjects - entities.Object, attribute and connection are fundamental concepts of I.S.

An object(or entity) is something that exists and discernible,that is, an object can be called that "something" for which there is a name and a way to distinguish one similar object from another. For example, every school is an object. Objects are also a person, a class at a school, a firm, an alloy, a chemical compound, etc. Objects can be not only material objects, but also more abstract concepts that reflect the real world. For example, events, regions, works of art; books (not as printed products, but as works), theatrical performances, films; legal norms, philosophical theories, etc.

Attribute(or given)- this is some indicator that characterizes an object and takes a certain numerical, textual or other value for a specific instance of the object. The information system operates with a set of objects designed in relation to a given subject area, using specific attribute values(data) of certain objects. For example, let's take classes in a school as a set of objects. The number of students in a class is a given, which takes a numerical value (one class has 28, another has 32). The name of the class is a given one that takes a textual meaning (one has 10A, the other has 9B, etc.).

The development of relational databases began in the late 1960s when the first papers appeared discussing; the possibility of using familiar and natural ways of presenting data in the design of databases - the so-called tabular datalogical models.

The founder of the theory of relational databases is considered to be an employee of IBM, Dr. E. Codd, who published an article on 6 (June 1970) A Relational Model of Data for Large-Shared Data Banks(A relational data model for large collective data banks). This article first used the term "relational data model." The theory of relational databases, developed in the 70s in the United States by Dr. E. Codd, has a powerful mathematical foundation that describes the rules for efficiently organizing data. The theoretical basis developed by E. Codd became the basis for the development of the theory of database design.

E. Codd, being a mathematician by education, suggested using the apparatus of set theory (union, intersection, difference, Cartesian product) for data processing. He proved that any set of data can be represented in the form of two-dimensional tables of a special kind, known in mathematics as "relations."

Relationalsuch a database is considered in which all data is presented to the user in the form of rectangular tables of data values, and all operations on the database are reduced to manipulating tables.

The table consists of columns (fields)and lines (records);has a name that is unique within the database. Tablereflects object typethe real world (essence),and each of her string is a specific object.Each column in a table is a collection of values \u200b\u200bfor a particular attribute of an object. The values \u200b\u200bare selected from the set of all possible values \u200b\u200bof the object attribute, which is called domain.

In its most general form, a domain is defined by specifying some basic data type, to which the elements of the domain belong, and an arbitrary logical expression applied to the data elements. If, when evaluating a logical condition on a data item, the result is "true", then this item belongs to the domain. In the simplest case, a domain is defined as a valid potential set of values \u200b\u200bof the same type. For example, the set of dates of birth of all employees constitutes the "date of birth domain" and the names of all employees constitute the "employee name domain". The date of birth domain has a datatype to store information about points in time, and the employee name domain must be a character datatype.

If two values \u200b\u200bcome from the same domain, then you can compare the two values. For example, if two values \u200b\u200bare from the date of birth domain, you can compare them to determine which employee is older. If the values \u200b\u200bare taken from different domains, then their comparison is not allowed, since, in all likelihood, it does not make sense. For example, nothing definite will come out of comparing the name and date of birth of an employee.

Each column (field) has a name, which is usually written at the top of the table. When designing tables within a specific DBMS, it is possible to select for each field its a type,that is, to define a set of rules for displaying it, as well as to determine those operations that can be performed on the data stored in this field. The sets of types can differ for different DBMS.

The field name must be unique in the table, however different tables can have fields with the same name. Any table must have at least one field; the fields are arranged in the table in accordance with the order of their names when it was created. Unlike fields, strings do not have names; their order in the table is not defined, and the number is not logically limited.

Since the rows in the table are not ordered, it is impossible to select a row by its position - there is no “first”, “second”, “last” among them. Any table has one or more columns, the values \u200b\u200bin which uniquely identify each of its rows. Such a column (or a combination of columns) is called primary key... An artificial field is often introduced to number records in a table. Such a field, for example, can be its ordinal, which can ensure the uniqueness of each record in the table. The key must have the following properties.

Uniqueness.At any moment in time, no two different tuples of the relation have the same value for the combination of the attributes included in the key. That is, the table cannot contain two lines with the same identification number or passport number.

Minimality.None of the attributes included in the key can be excluded from the key without violating the uniqueness. This means that it is not necessary to create a key that includes both the passport number and the identification number. It is sufficient to use any of these attributes to uniquely identify the tuple. You should also not include a non-unique attribute in the key, that is, it is prohibited to use a combination of identification number and employee name as a key. By excluding the employee name from the key, you can still uniquely identify each row.

Each relation has at least one possible key, since the totality of all its attributes satisfies the uniqueness condition - this follows from the very definition of the relation.

One of the possible keys is randomly selected in as the primary key.Other possible keys, if any, are taken as alternative keys.For example, if you select an identification number as the primary key, then the passport number will be an alternative key.

The relationship of tables is an essential element of the relational data model. She is supported foreign keys.

When describing a relational database model, different terms are often used for the same concept, depending on the level of description (theory or practice) and the system (Access, SQL Server, dBase). Table 2.3 summarizes the terms used.

Table 2.3.Database terminology

Database Theory ____________ Relational Databases _________ SQL Server __________

Relation Table Table

Tuple Record Row

Attribute (Field) _______________ Column or column

Relational databases

Relational databaseis a set of relations containing all the information that must be stored in the database. That is, the database represents the set of tables required to store all of the data. Relational database tables are logically related to each other. Requirements for the design of a relational database in general can be reduced to several rules.

О Each table has a unique name in the database and consists of rows of the same type.

О Each table consists of a fixed number of columns and values. More than one value cannot be stored in one column of a row. For example, if there is a table with information about the author, publication date, circulation, etc., then the column with the author's name cannot contain more than one surname. If the book is written by two or more authors, you will have to use additional tables.

О At no point in time will there be two rows in the table that duplicate each other. Rows must differ by at least one value to be able to uniquely identify any row in the table.

О Each column is assigned a name that is unique within the table; a specific data type is set for it so that homogeneous values \u200b\u200b(dates, surnames, phone numbers, sums of money, etc.) are placed in this column.

О The complete informational content of the database is represented in the form of explicit values \u200b\u200bof the data itself, and this is the only method of presentation. For example, the relationship between tables is based on the data stored in the corresponding columns, and not on the basis of any pointers that artificially define relationships.

О When processing data, you can freely refer to any row or column of the table. The values \u200b\u200bstored in the table do not impose any restrictions on the order in which the data is accessed. Column description,

RELATIONAL DATABASE AND ITS FEATURES. TYPES OF RELATIONS BETWEEN RELATIONAL TABLES

Relational database is a collection of interconnected tables, each of which contains information about objects of a certain type. A table row contains data about one object (for example, a product, a customer), and the table columns describe various characteristics of these objects - attributes (for example, name, product code, customer information). Records, that is, rows of a table, have the same structure - they consist of fields that store the attributes of an object. Each field, i.e., column, describes only one characteristic of the object and has a strictly defined data type. All records have the same fields, only they display different informational properties of the object.

In a relational database, each table must have a primary key — a field or combination of fields that uniquely identifies each row in the table. If a key consists of several fields, it is called a composite key. The key must be unique and uniquely identify the record. The key value can be used to find a single record. Keys are also used to organize information in the database.

Relational database tables must meet the relationship normalization requirements. Normalization of relations is a formal apparatus of constraints on the formation of tables, which eliminates duplication, ensures the consistency of those stored in the database, and reduces the labor costs of maintaining the database.

Let the Student table be created, containing the following fields: group number, full name, student student number, date of birth, specialty shuffling, faculty name. Such an organization of information storage will have a number of disadvantages:

duplication of information (the name of the specialty and faculty is repeated for each student), therefore, the volume of the database will increase;
the procedure for updating information in the table is difficult due to the need to edit eachtable records.

Table normalization is designed to address these shortcomings. There is three normal forms of relationship.

First normal form. A relational table is reduced to first normal form if and only if none of its rows contains more than one value in any of its fields and none of its key fields is empty. So, if from the Student table it is required to obtain information by the student's name, then the Full name field should be divided into the Surname, Name, Patronymic parts.

Second normal form... A relational table is specified in second normal form if it meets the requirements of first normal form and all of its fields that are not included in the primary key are fully functionally related to the primary key. To bring the table to the second normal form, it is necessary to determine the functional dependence of the fields. The functional dependence of fields is a dependence, when in the instance of the information object only one value of the descriptive attribute corresponds to a certain value of the key attribute.

Third normal form. A table is in third normal form if it satisfies the requirements of second normal form, none of its non-key fields is functionally dependent on any other non-key field. For example, in the Student table (group number, full name, grade book number, date of birth, starosta) three fields - gradebook number, group number, starosta are in a transitive relationship. The group number depends on the grade book number, and the Headman depends on the group number. To eliminate the transitive dependence, it is necessary to transfer part of the fields of the Student table to another table, Group. The tables will take the following form: Student (group number, full name, grade book number, date of birth), group (group number, headman).

The following operations are possible on relational tables:

Concatenation of tables with the same structure. The result is a common table: first the first, then the second (concatenation).
Intersection of tables with the same structure. Result - those records are selected that are in both tables.
Subtraction of tables with the same structure. Result - those records are selected that are not in the subtracted.
Sample (horizontal subset). Result - records are selected that meet certain conditions.
Projection (vertical subset). The result is a relation that contains some of the fields from the original tables.
Cartesian Product of Two Tables The records of the resulting table are obtained by combining each record of the first table with each record of the other table.

Relational tables can be related to each other, therefore, data can be retrieved from multiple tables at the same time. The tables are linked together in order to ultimately reduce the size of the database. The relationship of each pair of tables is provided if they have the same columns.

There are the following types of information links:

one to one;
one-to-many;
many-to-many.

One-to-one communication assumes that one attribute of the first table matches only one attribute of the second table, and vice versa.

One-to-many relationshipassumes that one attribute in the first table corresponds to several attributes in the second table.

Many-to-many relationship assumes that one attribute of the first table corresponds to several attributes of the second table, and vice versa.

Transfer

Translator's note: although the article is quite old (published 2 years ago) and has a loud title, it still gives a good idea of \u200b\u200bthe differences between relational databases and NoSQL databases, their advantages and disadvantages, and also provides a brief overview of non-relational storage.

A lot of non-relational databases have emerged lately. This suggests that if you want nearly unlimited on-demand scalability, you need a non-relational database.

If this is true, does this mean that the mighty relational databases are vulnerable? Does this mean the days of relational databases are over and will soon be over? In this article, we'll take a look at the popular trend of non-relational databases for a variety of situations, and see if this impacts the future of relational databases.

Relational databases have been around for about 30 years. During this time, several revolutions broke out that were supposed to end relational storage. Of course, none of these revolutions took place, and none of them shook the position of relational databases in the least.

Let's start with the basics

A relational database is a collection of tables (entities). Tables are made up of columns and rows (tuples). Constraints can be defined within tables, relationships exist between tables. You can use SQL to run queries that return datasets from one or more tables. Within a single query, data is obtained from several tables by joining them (JOIN), most often the same columns are used for joining that define the relationships between tables. Normalization is the process of structuring a data model to ensure consistency and non-redundancy in the data.

Relational databases are accessed through relational database management systems (RDBMS). Almost all database systems we use are relational such as Oracle, SQL Server, MySQL, Sybase, DB2, TeraData, and so on.

The reasons for this dominance are not obvious. Throughout the history of relational databases, they have consistently offered the best blend of simplicity, robustness, flexibility, performance, scalability, and interoperability in data management.

However, to provide all of these features, relational storage is incredibly complex internally. For example, a simple SELECT query can have hundreds of potential execution paths that the optimizer will evaluate directly at runtime. All of this is hidden from users, but inside the RDBMS it creates an execution plan based on things like costing algorithms and best suited to the query.

Relational database problems

While relational storage provides the best blend of simplicity, robustness, flexibility, performance, scalability, and interoperability, it does not necessarily perform better on each of these than similar systems that focus on one particular feature. This was not a big problem, as the overwhelming dominance of relational DBMSs outweighed any shortcomings. However, if conventional RDBs did not meet the needs, there were always alternatives.

Today the situation is a little different. The variety of applications grows, and with it the importance of the listed features grows. And as the number of databases grows, one feature begins to overshadow all others. It's scalability. As more and more applications run under high load conditions such as web services, their scalability requirements can change very quickly and grow dramatically. The first problem can be very difficult to solve if you have a relational database located on your own server. Suppose the server load has tripled overnight. How quickly can you upgrade the hardware? The solution to the second problem also causes difficulties in the case of using relational databases.

Relational databases scale well only if they are located on a single server. When the resources of this server run out, you will need to add more machines and distribute the load between them. This is where the complexity of relational databases starts to play against scalability. If you try to increase the number of servers not to just a few, but to a hundred or a thousand, the complexity increases by an order of magnitude, and the characteristics that make relational databases so attractive rapidly reduce to zero the chances of using them as a platform for large distributed systems.

To stay competitive, cloud vendors have to somehow deal with this limitation, because what kind of cloud platform is it without scalable data storage. Therefore, vendors have only one option if they want to provide users with scalable storage space. It is necessary to use other types of databases that are more scalable, albeit at the cost of other capabilities available in relational databases.

These advantages, as well as the existing demand for them, have led to a wave of new database management systems.

New wave

This type of database is commonly referred to as a key-value store. In fact, there is no official name, so you can see it in the context of document-oriented, attribute-oriented, distributed databases (although they can also be relational), sharded sorted arrays, distributed hash tables and storage key-value type. While each of these names indicates specific features of the system, they are all variations on a theme that we will call key-value storage.

However, whatever you call it, this "new" type of database is not that new and has always been used mainly for applications for which relational databases would not be suitable. However, without the need for scalability on the web and the cloud, these systems were not in high demand. Now the challenge is to determine which type of storage is more suitable for a particular system.
Relational databases and key-value storages differ fundamentally and are designed to solve different problems. Comparing the characteristics will only allow you to understand the difference between them, but let's start with this:

Storage characteristics

Relational database	Key-value storage
A database is made up of tables, tables are made up of columns and rows, and rows are made up of column values. All rows in one table have the same structure.	For domains, you can draw an analogy with tables, but unlike tables for domains, the data structure is not defined. A domain is a box in which you can put anything you want. Records within the same domain can have different structures.
Data Model 1 is predefined. It is strongly typed and contains constraints and relationships to ensure data integrity.	Records are identified by key, with each record having a dynamic set of attributes associated with it.
The data model is based on the natural representation of the contained data, not on the functionality of the application.	In some implementations, attributes can only be strings. In other implementations, attributes have simple data types that reflect types used in programming: integers, arrays of strings, and lists.
The data model is normalized to avoid duplicate data. Normalization creates relationships between tables. Relationships link data from different tables.	Relationships are not explicitly defined between domains, as well as within one domain.

No joins

Key-value stores are record-oriented. This means that all information related to a given record is stored with it. A domain (which you can think of as a table) can contain countless different records. For example, a domain can contain information about customers and orders. This means that data is usually duplicated between different domains. This is an acceptable approach since disk space is cheap. The main thing is that it allows all related data to be stored in one place, which improves scalability, since there is no need to join data from different tables. When using a relational database, you would need to use joins to group the information you need in one place.

Although the need for a relationship drops dramatically to store key-value pairs, relationships are still needed. Such relationships usually exist between the main entities. For example, an ordering system would have records that contain data about customers, products, and orders. It doesn't matter if this data is in one domain or in several. The bottom line is that when a customer places an order, you probably don't want to keep the customer and order information in one record.
Instead, the order record must contain keys that point to the corresponding customer and product records. Since records can store any information, and relationships are not defined in the data model itself, the database management system will not be able to control the integrity of relationships. This means that you can delete customers and products they ordered. Ensuring data integrity falls entirely on the application.

Data access

Relational database	Key-value storage
Data is created, updated, deleted, and queried using Structured Query Language (SQL).	Data is created, updated, deleted and queried using API method calls.
SQL queries can retrieve data both from a single table and from multiple tables using joins.	Some implementations provide SQL-like syntax for specifying filter conditions.
SQL queries can include aggregations and complex filters.	Often, only basic comparison operators (\u003d,! \u003d,<, >, <= и =>).
A relational database usually contains inline logic such as triggers, stored procedures, and functions.	All business and data integrity logic is contained in the application code.

Interaction with applications

Key-value stores: benefits

There are two distinct advantages of such systems over relational storage.

Suitable for cloud services

The first advantage of key-value stores is that they are simpler and therefore more scalable than relational databases. If you are hosting your own system together, and are planning to host a dozen or a hundred servers that need to cope with an increasing load behind your data warehouse, then key-value stores are your choice.

Because they can be easily and dynamically expanded, they are also useful for vendors who provide a multi-tenant web storage platform. Such a database represents a relatively cheap storage medium with great potential for scalability. Users usually only pay for what they use, however their needs can grow. The vendor will be able to dynamically and practically without restrictions increase the size of the platform, based on the load.

More natural code integration

The relational data model and code object model are usually built differently, leading to some incompatibility. Developers solve this problem by writing code that maps the relational model to the object model. This process does not have a clear and quickly achievable value and can take quite a lot of time that could be spent developing the application itself. In the meantime, many key-value stores store data in a structure that maps to objects more naturally. This can significantly reduce development time.

Other arguments for using key-value stores, such as "Relational databases can get clumsy" (by the way, I have no idea what that means), are less compelling. But before you become a proponent of such repositories, check out the next section.

Key-value stores: disadvantages

Constraints in relational databases ensure data integrity at the lowest level. Data that does not meet the constraints physically cannot get into the database. There are no such restrictions in key-value storages, so data integrity control rests entirely with applications. However, there are bugs in any code. While errors in a well-designed relational database usually do not lead to data integrity problems, errors in key-value stores usually lead to such problems.

Another advantage of relational databases is that they force you through the process of developing a data model. If you have a well-designed model, the database will contain a logical structure that fully reflects the structure of the stored data, but is at odds with the structure of the application. Thus, the data becomes application independent. This means that another application can use the same data and the application logic can be changed without any changes to the base model. To do the same with key-value storage, try replacing the relational model design process with class design, which creates generic classes based on the natural data structure.

And don't forget about compatibility. Unlike relational databases, cloud-based storage has fewer common standards. Although conceptually they are not different, they all have different APIs, request interfaces and their own specifics. Therefore, you are better off trusting your vendor, because if something happens, you cannot easily switch to another service provider. And given the fact that almost all modern key-value stores are in beta 2, trusting becomes even more risky than relational databases.

Limited data analytics

Typically, all cloud storage is built on a multi-lease type, which means that a large number of users and applications use the same system. To prevent "hijacking" of the overall system, vendors usually restrict query execution in some way. For example, in SimpleDB, a query cannot take longer than 5 seconds. Google AppEngine Datastore cannot get more than 1000 records per request 3.

These restrictions are not scary for simple logic (creating, updating, deleting and retrieving a small number of records). But what if your app becomes popular? You've got a lot of new users and a lot of new data, and now you want to make new experiences for users, or somehow benefit from the data. This is where you can go haywire with even simple queries to analyze data. Features like tracking app usage patterns or a recommendation system based on user history can be tricky at best. And at worst, they are simply impossible.

In this case, it is better for analytics to create a separate database that will be filled with data from your key-value storage. Think in advance how this can be done. Will you host the server in the cloud or on your own? Will there be any problems with signal delays between you and your provider? Does your storage support this data transfer? If you have 100 million records, and you can take 1000 records at a time, how much will it take to transfer all the data?

However, don't prioritize scalability. It will be useless if your users decide to use another service, because that provides more options and settings.

Cloud storage

Many web service providers offer multi-tenant key-value stores. Most of them meet the criteria listed above, but each has its own distinctive features and differs from the standards described above. Let's take a look at specific example repositories such as SimpleDB, Google AppEngine Datastore, and SQL Data Services.

Amazon: SimpleDB

SimpleDB is a key-value attribute-oriented store that is part of Amazon WebServices. SimpleDB is in beta; users can use it for free - until their needs exceed a certain limit.

SimpleDB has several limitations. First, the query execution time is limited to 5 seconds. Second, there are no data types other than strings. Everything is stored, retrieved, and compared as a string, so in order to compare dates you will need to convert them to ISO8601 format. Third, the maximum size of any string is 1024 bytes, which limits the size of the text (for example, product description) that you can store as an attribute. However, since the data structure is flexible, you can work around this limitation by adding the attributes "Product Description1", "Product Description2", and so on. But the number of attributes is also limited - a maximum of 256 attributes. While SimpleDB is in beta, the domain size is limited to 10 gigabytes, and the entire database cannot be more than 1 terabyte.

One of the key features of SimpleDB is its use of an eventual consistency model. This model is suitable for multi-threaded work, but keep in mind that after you change the value of an attribute in a record, subsequent read operations may not see these changes. The likelihood of such a development of events is quite low, however, it must be remembered. You don't want to sell the last ticket to five customers just because your data was inconsistent at the time of sale.

Google AppEngine Data Store

Google's AppEngine Datastore is built on top of BigTable, Google's internal structured data storage system. The AppEngine Datastore does not provide direct access to BigTable, but can be thought of as a simplified interface to interact with BigTable.

AppEngine Datastore supports more data types within a single record than SimpleDB. For example, lists, which can contain collections within a record.

Most likely you will use this data store when developing with the Google AppEngine. However, unlike SimpleDB, you cannot use the AppEngine Datastore (or BigTable) outside of Google Web Services.

Microsoft: SQL Data Services

SQL Data Services is part of the Microsoft Azure platform. SQL Data Services is free, in beta and has database size limits. SQL Data Services is a separate application - an add-on over many SQL servers that store data. These stores can be relational, but for you SDS is a key-value store like the products described above.

Non-cloud storage

There are also a number of repositories that you can use outside the cloud by installing them on your own. Almost all of these projects are young, in alpha or beta, and open source. With open source, you may be more aware of potential problems and limitations than with proprietary products.

CouchDB

CouchDB is a free and open source document-oriented database. JSON is used as the data storage format. CouchDB aims to fill the gap between document-oriented and relational databases using "views". Such views contain data from documents in a form similar to tabular, and allow you to build indexes and run queries.

CouchDB is not a truly distributed database at this time. It has replication features to keep data synchronized between servers, but this is not the kind of distribution needed to build a highly scalable environment. However, the CouchDB developers are working on it.

Voldemort project

The Voldemort project is a key-value distributed database designed to scale out across a large number of servers. It was born during the development of LinkedIn and has been used for several systems with high scalability requirements. The Voldemort project also uses a final consistency model.

Mongo

Mongo is a database developed at 10gen by Geir Magnusson and Dwight Merriman (who you may know from DoubleClick). Like CouchDB, Mongo is a document-oriented database that stores data in JSON format. However, Mongo is more of an object base than a pure key-value store.

Drizzle

Drizzle presents a very different approach to solving the problems that key-value stores are designed to deal with. Drizzle started out as a branch of MySQL 6.0. Later, the developers removed a number of functions (including views, triggers, compiled expressions, stored procedures, query cache, ACLs, and part of data types) in order to create a simpler and faster DBMS. However, Drizzle can still be used to store relational data. The goal of the developers is to build a semi-relational platform designed for web and cloud applications running on systems with 16 or more cores.

Decision

Ultimately, there are four reasons why you might choose non-relational key-value storage for your application:

Your data is highly document-oriented and is more suited to a key-value data model than a relational data model.
Your domain model is highly object-oriented, so using key-value storage will reduce the amount of extra code for data transformation.
The data warehouse is cheap and easy to integrate with your vendor's web services.
Your main concern is high scalability on demand.

However, when making your decision, be aware of the limitations of specific databases and the risks that you will encounter by going down the path of using non-relational databases.

For all other requirements, it is better to choose the good old relational DBMS. So are they doomed? Of course not. At least for now.

1 - in my opinion, the term "data structure" is more appropriate here, but left the original data model.
2 - most likely, the author had in mind that in terms of their capabilities, non-relational databases are inferior to relational ones.
3 - the data may be out of date, the article is dated February 2009.

voldemort

drizzle

Add tags

And also examined using the SET NAMES command and the configuration file my.ini. Today there will be a short and, if I may say so, a theoretical article devoted to the question - what are databases and what databases are there.

In this article I will try to summarize what types and types of databases are and I will dwell on some of them in more detail. We'll talk about hierarchical database structure, let's pay attention network database structure, and dwell on relational database structure, consider features of relational databases... And at the end of the article, let's touch on the topic a little. database design, naturally relational, since the MySQL server is essentially a mathematical model of relational databases. Database design and the data types that MySQL server can handle are topics for future posts.

Database. Mathematical models, structure, definition.

Although I'm not going to talk in detail on my blog about mathematical laws and theories describing relational databases, but I must tell the principle of how they work, if you are interested in this topic, then you can always visit a specialized mathematical resource or read the relevant literature, or you can always ask a question in the comments to this publication, and I will try to answer you as far as possible. As I said, the topic of this article is - relational databases... I will try to answer the question, h what is relational databases simple and understandable language. I will touch upon the basic concepts related to relational databases, terminology, the history of the emergence of databases in general and relational in particular.

Types and types of databases

As I already said, there are a lot of types and types of databases and I simply cannot describe them all in this publication, but the most common types of information storage or I will try to describe the types of databases. It is clear that the database stores information about some objects, for example, information about a product in an online store. Any product in the database is an object with some specific parameters and properties. Let's move on to specific examples.

Hierarchical database, structure of hierarchical databases

Hierarchical database - each object with such storage of information is represented in the form of a certain entity, that is, this entity can have child elements, parent elements, and those children can have more children, but there is one object from which everything begins. It turns out a kind of tree. An example of a hierarchical database can be, or the file system of a computer, an example with a file system of a computer I gave when I was looking at the structure of an XML document, in the section Notes on XML.

It should be said that databases of this kind are optimized for reading informationThat is, databases with a hierarchical structure are able to very quickly select the requested information and give it to users. But such a structure does not allow us to go through information as quickly, here you can give an example from life, a computer can easily work with any specific file or folder (which, in fact, are objects of a hierarchical structure), but antivirus scanning of a computer takes a very long time. The second example is the Windows Registry.

In the picture you can see hierarchical database structure, at the very top is parent or root element, below are child elements, elements on the same level are called brothers, well, or neighboring elements. Accordingly, the lower the level of an element, the more nesting this element is.

Network database, network database structure

Network databases, are a kind of modification of hierarchical databases... If you looked closely at the picture above, you probably noticed that only one arrow from the top element goes to each lower element. That is, for hierarchical databases, each child can only have one child. Network databases differ from hierarchical databases in that a child element can have several ancestors, that is, elements above it. For greater clarity and understanding network database structures pay attention to the picture:

It is worth noting that network databases have approximately the same characteristics as hierarchical databases. But, in this heading we are not very interested in hierarchical and network databases, this topic is more related to xML format, and perhaps under the heading on Extensible Markup Language, I will try to take a closer look at this topic. And in the section dedicated to MySQL we are interested in relational databases, on which we will dwell in more detail.

Relational databases, structure of relational databases

Relational databases are very widespread and many are trying to write huge articles dedicated to the issue - why relational databases are so widespread, make thoughtful conclusions and remarks. But in fact, everything is very simple - relational databases are very easy to describe in mathematics, that is, mathematics is very well written for them.

There was once such a mathematician - Edgar Frank Codd, who died in 2003, who described in great detail in the eighties structure of relational databases in mathematical language... And if there is well-written mathematics, then, accordingly, there is a software implementation. Stop at the biography of E.F. I won't codd, there are various encyclopedias for this. It was thanks to Codd that relational databases began to flourish. Therefore, when we say databases, most often we mean exactly relational databases.

Features of relational databases

The main feature of relational databases is, the fact that objects inside such databases are stored in the form a set of two-dimensional tables... That is, the table consists of a set of columns, which may contain: name, data type (date, number, string, text, etc.). Another important feature of relational databases is, that the number of columns is fixed, that is, database structure is known in advance, but the number of rows or rows in relational databases is not limited by anything, roughly speaking, strings in relational databases are objectsthat are stored in the database.

In fact, database Is an abstract concept, a table is just a way to store information, set of tables can be linked logically and this set is called a database. Therefore, it is wrong to say that MySQL is a database, a database is stored information. But such a concept as DBMS - database management system, That's what it is MySQL server, it is with the help of it that we manage the stored data. Or otherwise mathematical ideas.

The most difficult task when working with relational databases, is an database design... Designing the structure of a database is not only about creating a table and specifying the data type and column names. In fact, design is the most difficult stage when working with databases... Because the power of your computers is limited. As long as there is little data, few tables and rows in these tables, the machine will process them very, very quickly. But, over time, the amount of information will increase, and we will get a slowdown, which will increase, since the machine needs time to process certain requests (information processing). In the last article I already wrote that relational databases are primarily focused on modification (OLTP), that is, adding a new record to a table is a very simple operation for relational DBMS, but making a data selection is already a laborious operation. There is also a change in data, it is like an intermediate link between reading and adding. Although .

Database design

Well, we talked a little about advantages and disadvantages of relational databases... And now, in a nutshell, I will touch on the question database design... By design, I mean the following: a person sits at the table, takes paper and a pen, and based on the task at hand, as well as, based on the advantages and disadvantages of a particular system, in our case MySQL DBMS begins to compose the structure of the future database. The requirement for the designed database is usually set as follows:

The database should be as compact as possible, that is, non-redundant.
The database should be simple to process.

And as you probably understood, these requirements contradict each other. Design is the most important aspect when working with databases. Typically, the designer is an experienced database server administrator or a database architect with extensive experience. In serious projects, there may be several dozen, or even hundreds of tables, which are interconnected in the most intricate communication methods. Of course, I'm not going to delve into database design, and I can't do it, but some database design basics I'll try to highlight on my blog pages. Before you start designing a database, you need to understand what we are generally going to design. That is, we must understand what we should get at the output.

And at the exit we should get the so-called diagram or as it is also called schema... A chart is a definition: what information will be stored, in which table it will be stored, in which column of which data type, what is the name of the table, how many columns in the table and their type, how the tables are related. Yes, the data types in the columns can be different, for example, the phone number or the index can be written both using symbols and using a numeric data type. But the question appears: what type of data is best for storing phone number or zip code? Purely intuitively, this question is most often answered correctly - the phone number in the database must have a character type, but few can explain why a character type is. The explanation is very simple, for example, we needed all zip codes starting with 637 or phone numbers starting with 952, so making such a selection from data of a numeric type is a rather problematic task, and making the same selection from data of a character type is quite easy.

Relational database - basic concepts

Often, when talking about a database, they mean just some kind of automated data warehouse. This view is not entirely correct. Why this is so will be shown below.

Indeed, in the narrow sense of the word, a database is a certain set of data necessary for work (actual data). However, data is an abstraction; no one has ever seen "just data"; they do not arise and do not exist by themselves. Data is a reflection of objects in the real world. Suppose, for example, you want to store information about the parts received at the warehouse. How will a real world object - a part - be displayed in the database? In order to answer this question, you need to know which signs or sides of the part will be relevant, necessary for work. Among them may be the name of the part, its weight, dimensions, color, date of manufacture, material from which it is made, etc. In traditional terminology, objects of the real world, information about which are stored in the database, are called entities (even if this word does not scare the reader, this is a generally accepted term), and their actual signs are called attributes.

Each feature of a particular object is the value of the attribute. For example, the engine part has a weight attribute value of 50, which reflects the fact that this engine weighs 50 kilograms.

It would be a mistake to think that only physical objects are reflected in the database. She is able to absorb information about abstractions, processes, phenomena - that is, about everything that a person encounters in his activities. For example, a database can store information about orders for the supply of parts to a warehouse (although it is not a physical object, but a process). The attributes of the "order" entity will be the name of the supplied part, the number of parts, the name of the supplier, the delivery time, etc.

Objects of the real world are connected with each other by many complex dependencies that must be taken into account in information activities. For example, parts are supplied to the warehouse by their manufacturers. Therefore, the attribute "manufacturer name" must be included in the part attributes. However, this is not enough, since you may need additional information about the manufacturer of a particular part - his address, phone number, etc. This means that the database should contain not only information about parts and purchase orders, but also information about their manufacturers. Moreover, the database should reflect the relationship between parts and manufacturers (each part is produced by a specific manufacturer) and between orders and parts (each order is issued for a specific part). Note that only relevant, meaningful links need to be stored in the database.

Thus, in the broad sense of the word, a database is a collection of descriptions of objects in the real world and connections between them, relevant for a specific applied area. In what follows, we will proceed from this definition, making it more precise in the course of the presentation.

Relational data model

So, we got an idea of \u200b\u200bwhat is stored in the database. Now you need to understand how entities, attributes and relationships are mapped to data structures. This is determined by the data model.

Traditionally, all DBMSs are classified according to the data model that underlies them. It is customary to distinguish hierarchical, network and relational data models. Sometimes a data model based on inverted lists is added to them. Accordingly, one speaks of hierarchical, network, relational DBMS or DBMS based on inverted lists.

In terms of prevalence and popularity, relational DBMSs are out of competition today. They have become a de facto industrial standard, and therefore a domestic user will have to face in their practice with a relational DBMS. Let's take a quick look at the relational data model without going into its details.

It was developed by Codd in 1969-70 on the basis of the mathematical theory of relations and is based on a system of concepts, the most important of which are table, relation, row, column, primary key, foreign key.

A relational database is one in which all data is presented to the user in the form of rectangular tables of data values, and all operations on the database are reduced to manipulating the tables. The table consists of rows and columns and has a name that is unique within the database. The table reflects the type of object in the real world (entity), and each of its rows is a specific object. For example, the Part table contains information about all parts stored in the warehouse, and its rows are sets of attribute values \u200b\u200bfor specific parts. Each column in a table is a collection of values \u200b\u200bfor a particular attribute of an object. For example, the Material column is a set of values \u200b\u200bfor "Steel", "Tin", "Zinc", "Nickel", and so on. The Quantity column contains non-negative integers. The values \u200b\u200bin the Weight column are real numbers equal to the weight of the part in kilograms.

These values \u200b\u200bdo not appear out of thin air. They are selected from the set of all possible values \u200b\u200bfor an attribute of an object, which is called a domain. So, the values \u200b\u200bin the material column are selected from the set of names of all possible materials - plastics, wood, metals, etc. Therefore, in the Material column, it is fundamentally impossible for a value to appear that is not present in the corresponding domain, for example, "water" or "sand".

Each column has a name, which is usually written at the top of the table ( Figure: 1). It must be unique in the table, however, different tables can have columns with the same name. Any table must have at least one column; the columns are arranged in the table according to the order of their names when it was created. Unlike columns, rows do not have names; their order in the table is not defined, and the number is not logically limited.

Figure 1. Basic concepts of the database.

Since the rows in the table are not ordered, it is impossible to select a row by its position - among them there is no "first", "second", "last". Any table has one or more columns, the values \u200b\u200bin which uniquely identify each of its rows. Such a column (or combination of columns) is called the primary key. In the Part table, the primary key is the Part Number column. In our example, each part in the warehouse has a single number, by which the necessary information is extracted from the Part table. Therefore, in this table, the primary key is the Part Number column. Values \u200b\u200bin this column cannot be duplicated - in the Part table, there should not be rows that have the same value in the Part number column. If a table satisfies this requirement, it is called a relation.

The relationship of tables is an essential element of the relational data model. It is supported by foreign key. Consider an example in which the database stores information about ordinary employees (the Employee table) and managers (the Manager table) in some organization ( Figure: 2). The primary key of the Supervisor table is the Number column (for example, personnel number). The Last Name column cannot act as a primary key, since two managers with the same last names can work in the same organization. Any employee is subordinate to a single manager, which should be reflected in the database. The Employee table contains the Manager Number column, and the values \u200b\u200bin this column are selected from the Number column of the Manager table (see Figure: 2). The Supervisor Number column is a foreign key in the Employee table.

Figure 2. Relationship of database tables.

Tables cannot be stored and processed if the database does not contain "data about data", for example, descriptors for tables, columns, etc. They are usually referred to as metadata. Metadata is also presented in tabular form and is stored in a data dictionary.

In addition to tables, the database can store other objects such as displays, reports, views, and even applications that work with the database.

It is not enough for users of the information system that the database simply reflects objects of the real world. It is important that such a reflection is unambiguous and consistent. In this case, the database is said to satisfy the integrity condition.

In order to guarantee the correctness and mutual consistency of data, some constraints are imposed on the database, which are called data integrity constraints.

There are several types of integrity constraints. It is required, for example, that the values \u200b\u200bin a table column are selected only from the corresponding domain. In practice, more complex integrity constraints are also taken into account, for example, referential integrity. Its essence lies in the fact that a foreign key cannot be a pointer to a nonexistent row in a table. Integrity constraints are implemented using special tools, which will be discussed in Sec.Database server .

SQL language

By itself, data in computer form is not of interest to the user if there is no means of access to it. Data access is carried out in the form of database queries, which are formulated in a standard query language. Today, for most DBMS, this language is SQL.

The emergence and development of this language as a means of describing access to a database is associated with the creation of the theory of relational databases. SQL originated in 1970 as part of the System / R research project at the Santa Teresa lab at IBM. SQL is now the standard for interface with relational database management systems. Its popularity is so great that developers of non-relational DBMS (for example, Adabas) supply their systems with SQL-interface.

The SQL language has an official standard - ANSI / ISO. Most DBMS developers adhere to this standard, but often extend it to implement new data processing capabilities. New data management mechanisms to be described in Sec.Database server , can only be used through special SQL statements that are not generally included in the language standard.

SQL is not a traditional programming language. Not programs are written on it, but queries to the database. Therefore SQL is a declarative language. This means that it can be used to formulate what needs to be obtained, but it cannot be specified how it should be done. In particular, unlike procedural programming languages \u200b\u200b(C, Pascal, Ada), the SQL language lacks such operators as if-then-else, for, while, etc.

We will not go into detail about the syntax of the language. Let us touch on it only to the extent necessary to understand simple examples. With their help, the most interesting data processing mechanisms will be illustrated.

A SQL query consists of one or more statements, one after the other, separated by semicolons. Table 1 below lists the most important operators that are included in the ANSI / ISO SQL standard.

Table 1. Basic operators of the SQL language.

SQL queries use names that uniquely identify database objects. In particular, this is the name of the table (Detail), the name of the column (Name), as well as the names of other objects in the database that refer to additional types (for example, the names of procedures and rules), which will be discussed in Sec.Database server ... Along with simple names, complex names are also used - for example, the qualified column name defines the name of the column and the name of the table to which it belongs (Part.Weight). For simplicity, in the examples, names will be written in Russian, although in practice this is not recommended.

Each column in any table stores data of certain types. There are basic data types - strings of characters of fixed length, integers and real numbers, and additional data types - strings of characters of variable length, currency, date and time, logical data (two values \u200b\u200b- "TRUE" and "FALSE"). In SQL, you can use numeric, string, character, and date and time constants.

Let's look at some examples.

The query "determine the number of parts in stock for all types of parts" is implemented as follows:

SELECT Name, Quantity

FROM Part;

The query result will be a table with two columns - Name and Quantity, which are taken from the original Part table. Basically, this query allows you to get a vertical projection of the original table (more strictly, a vertical subset of the set of table rows). From all rows of the Part table, rows are formed that include values \u200b\u200btaken from two columns - Name and Quantity.

The query "What steel parts are in stock?", Formulated in SQL, looks like this:

FROM Part

WHERE Material \u003d "Steel";

The result of this query will also be a table containing only those rows of the source table that have the value "Steel" in the Material column. This query allows you to get a horizontal projection of the Part table (an asterisk in a SELECT statement means that you select all columns from the table).

The request "determine the name and quantity of parts in the warehouse that are made of plastic and weigh less than five kilograms" will be written as follows:

SELECT Name, Quantity

FROM Part

WHERE Material \u003d "Plastic"

AND Weight< 5;

The query result is a table of two columns - Name, Quantity, which contains the name and the number of parts made of plastic and weighing less than 5 kg. In fact, the selection operation is the operation of forming a horizontal projection first (find all rows of the Part table, for which Material \u003d "Plastic" and Weight< 5), а затем вертикальной проекции (извлечь Название и Количество из выбранных ранее строк).

Indexes are one of the tools that provide fast access to tables. An index is a database structure that is a pointer to a specific row in a table. The database index is used in the same way as a book index. It contains values \u200b\u200btaken from one or more columns of a particular table row and a link to that row. The values \u200b\u200bin the index are ordered, which allows the DBMS to perform fast searches in the table.

Let's assume that a query to the Warehouse database is formulated:

SELECT Name Quantity, Material

FROM Part

WHERE Number \u003d "T145-A8";

If there are no indexes for this table, then in order to execute this query, the DBMS must scan the entire Detail table, sequentially selecting rows from it and checking the selection condition for each of them. For large tables, such a query will take a very long time.

If the index was previously created on the column Table Number Part, then the search time in the table will be reduced to a minimum. The index will contain values \u200b\u200bfrom the Number column and a link to the row with this value in the Part table. When executing a query, the DBMS will first find the value "T145-A8" in the index (and will do it quickly, since the index is ordered and its rows are small), and then, by reference in the index, it will determine the physical location of the required row.

The index is created with the SQL CREATE INDEX statement. In this example, the operator

CREATE UNIQUE INDEX Part index

ON Part (Number);

will create an index named Part Index on the Part Table Number column.

For a user of a DBMS, it is not individual SQL statements that are of interest, but a certain sequence of them, formed as a whole and meaningful from his point of view. Each such sequence of SQL statements implements a specific action on the database. It is carried out in several steps, at each of which some operations are performed on the database tables. So, in the banking system, the transfer of a certain amount from a short-term account to a long-term account is performed in several operations. Among them - the withdrawal of an amount from a short-term account, crediting to a long-term account.

If during the execution of this action a failure occurs, for example, when the first operation is completed and the second is not, then the money will be lost. Therefore, any action on the database must be performed entirely, or not at all. This action is called a transaction.

Transaction processing relies on the log, which is used to roll back transactions and restore the state of the database. More details about transactions will be discussed in Sec.Transaction processing .

To conclude our discussion of SQL, let us emphasize once again that it is a query language. It is impossible to write any complex application program on it that works with a database. For this purpose, modern DBMSs use the fourth generation language (Forth Generation Language - 4GL), which has both the basic capabilities of the third generation procedural languages \u200b\u200b(3GL), such as C, Pascal, Ada, and the ability to embed SQL statements into the program text, as well as user interface controls (menus, forms, user input, etc.). Today 4GL is one of the de facto standards for database application development tools.