Formats of text files and programs for working with them: history and our days


Why do you need a text?

Today, there are three most common text formats - TXT, RTF and DOC. What is their difference and what do they have in common? They have one thing in common: they all store textual information. The difference lies in the possibilities of formatting and processing the text they provide, as well as the extent to which the information stored in them regarding the compatibility of programs is available.

The simplest text format

The oldest and modest in terms of format. All that can be done with the text in this format is to produce the proper input of the text and save the paragraph break. This simplicity in certain situations acquires the importance of universality and transparency: TXT is easily available for reading in different applications and on different platforms. In addition, many programs that do not even have their own direct work with text, are able to save text in the format of TXT.

TXT-processors

Since DOS-ovskih times, many remember the word processor Lexicon, which was able to handle the TXT-format at a fairly high level. Today the main tool for working with TXT is the standard Windows Notepad. Anyone who does not seem to have enough of his functions can always find an editor for the taste and needs of the World Wide Web, including free of charge. For example, using the freeware Vega Konstantin Sheremetyev program, you are unlikely to see a message stating that the opened text file is too large; on the author's assurances, Vega version 2.04 opens files up to 2 Gb (!), and the program itself takes only 9.5 kb (compare, Notepad in Windows XP "weighs" about 65 kb); In this case, Vega is even more convenient than Notepad and does not require installation. And here is another example of the possibilities of processing "plain text". The text that you read was typed in an UltraEdit processor from IDM Computer Solutions. Its strong side is a special display and processing of the syntax of programming languages, but with the most straightforward text it can work wonders. For connoisseurs of handy Russified programs, ergonomic and, most importantly, "knowledgeable" in the specifics of Cyrillic encodings, it is worth getting acquainted with the Patriot program.

Formatting and universality

Rich Text Format - this stands for abbreviation, which stands in the name of the format created by Microsoft Corporation. RTF is a text marked with special "control words", which allows you to produce and save a fairly complex formatting, insert footnotes, footers, drawings, tables and formulas, although in processing these additional objects RTF is inferior to the DOC format. He concedes the DOC and the amount of files: the use of "control words" for formatting text instead of a style table does not lead to compactness. However, RTF wins the dispute with DOC regarding security, because its internal organization does not provide storage of a macro code and, therefore, is immune to macro viruses.

RTF-processors

RTF is used as the primary or supported format in many, if not most, word processing programs. A good tool   can serve, for example, Hieroglyph Mikhail Morozov. In this program, not only the spelling of the Russian language is implemented, but also the function of automatic change of the language keyboard layout. The Atlantis text processor from Rising Sun Solutions, existing both in commercial and in free versions, will surely suit many users by the thoughtfulness of the interface, the presence of a large number of keyboard shortcuts, an exchangeable toolbar and other functions. With the RTF is able to work and the already mentioned editor Patriot.

The most "large" text format

The DOC format includes the widest possibilities of processing and formatting text, including the creation of footnotes and comments, as well as the creation, placement and editing of tables, diagrams, images and other elements. True, in full and most correctly all these features are implemented only in MS Word, which is facilitated by Microsoft's position, which does not disclose the current specifications of the popular format. Despite the fact that DOC "understand" and other programs, their manufacturers do not always manage to ensure its correct recognition. Unlike TXT and RTF, DOC is a binary format, which makes it unreadable in simple text editors and, moreover, does not provide full compatibility of its own versions.

DOC-processors

The main and, in view of the above mentioned reasons, the "irreplaceable" word processor for working with DOC is MS Word, which most fully implements all the features of this format. A lot of productivity and functionality WORD added third-party development - all kinds of add-ons, macros and programs exist in large numbers on the network. Vordu competition is provided, for example, by Corel's WordPerfect, Sun Microsystems StarOffice and free OpenOffice.org. Working in both Word and other programs, one should keep in mind the problem of format compatibility and save the document to DOC only if you are sure that incompatibilities will not arise.

Applicability of formats

It is groundless to state that one of the formats considered is worse than the others, without taking into account the specific features of the problems for which they should be used. Since we will not set ourselves the task of making a layout in a word processor, the choice is almost unambiguous. To prepare the volume of text from medium to very large and to provide a "full understanding" of the typesetting typed by any program, it is most convenient to use the simplest, compact and versatile means of typing and storing text - the TXT format. As for the use of other text formats, very much depends on the implementation of their support in a specific prototyping program.
  OpenOffice.org is an open source international project aimed at creating a universal office suite that runs on different operating platforms, with an open API and an XML-based file format. In fact, OpenOffice.org is a set of programs developed within the framework of this project. It includes: a word processor, spreadsheets, graphics editor, a presentation system and a data access system. In terms of its capabilities, it is comparable to similar commercial programs and may well be considered as an alternative to them. At present, OpenOffice.org is released under a double license: GPL and SISSL. Despite the differences in these licenses, for the end user OpenOffice.org is free.

OpenOffice.org derives its origin from the office suite StarOffice, developed by the German company StarDivision in the mid-90s. In the fall of 1999, Sun bought StarDivision. In June 2000, StarOffice 5.2 was released under the Sun trademark under MS Windows, Linux and Solaris. On October 13, 2000, StarOffice source code was opened (excluding the code for some modules developed by third companies), and this day is officially considered the birthday of OpenOffice.org. Today, over the OpenOffice.org code, there are volunteers from all over the world, as well as programmers from Sun.

Currently, two products are produced from the same source code developed by the OpenOffice.org community: StarOffice, which adds components under a proprietary license and free OpenOffice.org. In OpenOffice.org, most of the proprietary components present in StarOffice are replaced with their free counterparts.

(According to cnews.ru.)

A set of rules for storing data in a file is called the file format. Various types   files, such as text files, raster graphics, etc., use different formats. In general, for several types of files, several different formats, although often the same type of file and format are understood. The file format is determined by the file name extension, which is added to the file name when it is saved in a certain format, for example, DOC, GIF, etc.

As a rule, file formats are created for use in a strictly defined application program. For example, graphic objects created in a known CorelDRAW vector graphics package are saved as files with a CDR extension, and images generated by another graphics package, CorelXara, are written to disk as files with the extension XAR. Some formats are not associated with specific applications, that is, they are universal. One of the most well-known universal formats is the TXT format (format text files   DOS).

Often, compression of computer files is used to save space on the medium. There are many ways to compress files. These methods depend on the original file format. Typically, the higher the compression ratio, the slower the read and write operations.

As for the compression algorithms, there are both compression algorithms without loss of data, and algorithms, in the use of which data loss is possible.



Lossless compression ensures that all data that was in the file before compression is present after the file is unpacked. Lossless compression mechanisms are used when saving text or numeric data, for example spreadsheets   or document files. Examples of lossless compression algorithms are well-known algorithms ZIP, ARJ, and others.

Let's give a brief description of the main formats used:

American Standard Code for Information Interchange ASCII (TXT). The format of text files developed by the American Institute of Standards (American National Standards Institute). Supported by all operating systems   and all programs. It is a text file in DOS-encoding, there is no function to insert a picture, there is no formatting, it works in all machines, it is possible to create only small volume files.

§ ANSI (TXT). The format of text files in ANSI encoding (for the code page Microsoft Windows)

§ MsWord for DOS, Windows (.DOC). The document format developed by Microsoft Corporation is supported by programs for MS-DOS and most word processors. It saves the original formatting of documents, as well as character styles. In addition to textual information, files of this format can contain graphic pictures with various parameters. Supports 256 colors. Does not support compression. Used mainly to exchange formatted text data between different platforms and applications.

§ Hypertext Markup Language HTML (HTM, HTML). Markup language hyper text documents. All pages located on the Internet are created using this special language. HTML documents are ASCII files, accessible for viewing and editing in any text editor. The difference from a conventional text file is that in HTML documents there are special tag commands that define the rule for formatting a document. If you can master the HTML language, then you can create pages for the Internet. Adding tags (labels) to the usual text, you force the viewer to display this text in a certain way and place it on the image page. If you've studied Java and JavaScript, you know how to extend the capabilities of HTML by putting commands written in the scripting language inside the tags.

§ Portable Document Format PDF (.PDF). This document storage format, developed by Adobe, claims the role of an open typographic standard for the Web. It is seen as an alternative to HTML. The disadvantage of HTML is that documents translated into HTML usually do not preserve the original format, and HTML offers a very limited number of headsets when viewing. On the contrary, users of the Acrobat program and PDF-tools for creating, distributing and viewing documents in the original format know that readers will see the publication exactly as it was made. The PDF format is indispensable if you want to get an exact copy of the required document. As an example of a successful application PDF   for the documents in Russian, we bring the "Moscow News" server to the Internet. The materials presented on it in electronic form completely repeat the paper original printed with the printing method.

§ Standard Generalized Markup Language (SGML). The development of HTML is translated as the standard language of generalized markup. It is an instrumental set of mechanisms for creating structured documents marked with tags. Compared to HTML, it provides more flexible and versatile formatting capabilities on the Web. However, SGML is different and increased speed, therefore as a simpler tool applies PDF. The power of SGML lies in its cross-platform structural approach to describing the content of documents. SGML is actually a metalanguage, i.e. It is intended for the description of the markup languages ​​used when creating documents.

   The most frequently used type of data in the computer world and the Internet is text. Video and graphics are much more colorful and generally better to see once than hear a hundred times. Well, to hear is also not bad - for this case there are audio data formats. However, the computer ball is ruled by unpretentious and modest letters and numbers. Without them, nowhere, even the name of another file can not be given. Text data is important and diverse - it's books, documents, and program code. And for each purpose there are own variations of formats. It is about them that will be discussed in this article. The truth should immediately make one reservation - in this review will not affect the formats of electronic books, they deserve a separate conversation. Here it will be said about the formats of the documents.

Text format - TXT (PlainText)

So - the simplest possible format is TXT. This text in its, in fact, pure and uncomplicated form. Contains only the contents of the text and the absolute minimum of the service data - characters of the beginning and the end of the text, carriage carriage and the like.

Despite the almost Spartan simplicity, the format is not devoid of options and differences. First, there are some differences between Windows   , Unix and MacOS versions in which different end-of-line characters are used. Also, differences can be caused by the use of 8-bit (ASCII) or 16-bit (UNICODE) code pages.

However, despite this, the TXT format is extremely versatile, for which we are very fond of programmers and system administrators.

Formats of MS Office documents and analogues - DOC, DOCX, RTF, ODT

For all its versatility and simplicity, TXT is absolutely unsuitable for creating the actual documents - texts intended for printing with observance of certain rules and design rules. Since such documents other than the text should contain a lot of information about the design and formatting of the text. And also about the format and size of a sheet of paper, where they should be placed.

For these purposes, quite a number of formats have been created for various office packages. The most popular and actually close to the universal one can be considered the MS Word formats - doc and docx. The first is a special closed format created by Microsoft for its text editor (more precisely, a whole line of formats - it has been improved several times over its existence). Along with it, at the dawn of the company's development, in cooperation with Adobe, the RTF (Rich Text Format) format was created. Unlike DOC, the structure of this format is available and it is successfully supported by almost all existing text editors. Although somewhat inferior to DOC for a set of available functions.

Closeness of the development of Microsoft led to the creation of an open office suite Open Office. For which the ODT (OpenDocument Text Format) format was developed. The format is not well supported by commercial editors, including MS Word   and can be opened by them with errors.

Finally, in 2007, Microsoft decided to abandon the DOC format and developed a family of formats for Office Open XML, which includes DOCX, which became the main formats for new versions of MS Word.

PDF format

Refusing to cooperate with Microsoft, Adobe has moved its own way. It was developed pDF format which was a format not so much for developing documents, as for viewing and printing. Unlike the previous group, which is a formatted text, whose appearance can nevertheless vary depending on which particular machine it is displayed on or printed on, PDF is a document format that is fundamentally unchanged and retains its appearance and layout in any conditions. Also, it supports a rather wide range of both printing elements and additional services (for example document protection with a password   from editing or printing and so on). All this makes PDF more a format for the distribution of complex and professionally executed documents and even books.

Each PC user constantly encounters different formats of text files, but hardly thinks about how rich the history of these formats and programs, gave the person the opportunity to read books, work with the text and create all the necessary documentation directly on the computer.

The history of text files is not much younger than themselves personal computers   - already their masterpieces were written in the first analogues of the modern "notebook". So what are the formats of text files and programs for working with them? First you need to understand what text files are for, what are the differences between them and what they have in common. It unites absolutely all text formats their main task - saving text information. They differ in the processing capabilities and access to the information stored in the files in terms of compatibility with other programs.

The simplest text format is traditionally the TXT format. It is the most modest in terms of features and the oldest text format. Due to its simplicity (TXT capabilities are limited to typing and breaking it into paragraphs), this format is often used by a huge number of applications and programs on a variety of platforms.

With the proliferation of personal computers and the increase in their sales, Microsoft is creating another popular format, called the Rich Text Format (or simply RTF). It is a text that is marked with the help of certain "control words", which allow not only to produce, but also to preserve complex formatting elements and to insert formulas, tables, figures, footnotes and footnotes into the text.

However, RTF is quite inferior in capabilities to the DOC format, also created by Microsoft specifically for the software package called Microsoft Office. Created more than fifteen years ago, DOC includes a huge number of opportunities for formatting and processing text, creating, editing and placing images, diagrams, tables and other elements. It should be noted that the most correct these functions will work only in MS Word. This is due primarily to the fact that Microsoft is not the current specifications of the DOC format and does not allow its competitors and independent developers to use the capabilities of this format to the fullest. This fact is one of the main reasons that in addition to the DOC format, other formats of text files are widely used nowadays.

The main difference between the DOC format and text and TXT is its binaryity, which makes it unreadable in such simple ones as Wordpad, Lexicon, Atlantis. Moreover, in some cases it is possible to observe incompatibility of DOC-files created in different versions of MS Word.

Formats of text files can be opened and edited in a huge number of programs. In addition to the previously mentioned MS Word, the most common of them are StarOffice, released by Sun Microsystems, WordPerfect from Corel and a free package OpenOffice.org.

With the proliferation of electronic readers, other types of text files are gaining popularity, for example, FB2 and LRF.

In order to be able to use different text formats on different platforms, a large number of programs have been created, called converters. Text file converters allow you to save source code from one format to another and use it later on various devices and platforms.

Converters are used not only to save text from one format to another, but also to create files that, unlike their source, can be used on devices that can not "read" the original files. For example, some e-books that do not support popular text file formats can easily recognize LRF or FB2 formats obtained from source files using converter programs.

Once the text data was placed only in one kind of container - TXT. There were no others. Now their number, perhaps, is approaching fifty. We use somehow constantly, we rarely encounter others. About the existence of third, we do not even suspect. Consider the most common text data stores in terms of convenience in use.
<<>>

ТХТ ("simple text")

The ancestor of the "genre". Actively used to this day. Since the text is stored as a sequence of characters, the size of the file in bytes is equal to the number of characters plus non-printable characters (space character, tabs, end of paragraph sign and others - they are also called formatting signs). Due to this, a small file size is achieved. However, the possibilities for formatting such documents are very limited. In fact - it's just text. Text data can be stored not only in containers with the extension of TXT. In fact, these extensions are not mandatory. Rename TXT to DOC, nothing will change. The internal structure will remain the same. Similarly, by changing the DOC extension to TXT, you get the same "vordian" file. Why then need these three letters after the point? For the correct interpretation of the programs that open them by default.

RTF (Rich Text Format)

A free cross-platform storage format for markup text documents, created by Microsoft in 1987. Today it is widely distributed, so most modern text editors support it. Having created RTF on the Windows platform, it will be perfectly read and edited on other platforms (Apple, Linux and others). The de facto standard in printing. However, not all programs create it equally well. It is noticed that in the document created in OpenOffice, formatting sometimes flew, and some of the text turned into unreadable symbols.

RTF allows you to produce and save a fairly complex formatting, insert footnotes, footers, drawings, tables and formulas, although in this it is still inferior to the DOC format. He concedes DOC and in the volume of files: complex documents are more compactly stored in DOC-files (simple - vice versa). However, RTF wins a dispute with DOC regarding security, since it does not use macros. Therefore, Word files infected with macro viruses can be "cured" by saving to RTF-format. In addition, the RTF format is resistant to file corruption. If you change at least one byte in the DOC file, it will no longer open in Word. And corrupting a file in RTF format can only lead to the loss of a corrupted piece of text.

DOC (from the English "document")

Initially, this extension was used to refer to simple text files without formatting, but in the early 90's, Microsoft actually "privatized" it. Therefore now DOC is associated only with the products of this company. This format provides great opportunities for formatting the text (scripts included, macros). Due to this, compatibility with text editors of third-party developers has deteriorated. A file of this format contains a huge amount of information about fonts, character tracing, paragraph indentations and intervals, even if you do not need all this. It is because of this additional information that the file containing only text exceeds the size of the RTF file. However, when you include various graphic elements and images in the document, the DOC wins in size and provides greater compatibility. Unlike TXT and RTF, DOC is a binary format, which makes it unreadable in simple text editors. For example, the Notepad can view some RTF files. It is popular on a par with RTF.

DOCX

With the advent of Office 2007, Microsoft has moved to new formats based on Office Open XML (visually distinguished by the addition of the letter "x" at the end to the extensions). The format is a zip-archive containing text in the form of XML, graphics and other data. To reduce the file size, ZiP compression is used. The documents are backward compatible with Office 2000 / XP / 2003 only if the Microsoft Office Compatibility Pack is installed (you can find and download it from the official Microsoft website, the file size is 27.8 MB). If you need to quickly convert DOCX to another format, use the services of the site http://docx-converter.com/. If you use latest version   Office and plan to transfer files to someone, save the documents in RTF or DOC.

ODT / ODF ("Open Document Format")

ODF - common name open format documents for office applications (text, tables, figures, databases, presentations). Text data is stored in files with the extension ODT. The standard was developed by the industrial community OASIS and is based on XML format. On May 1, 2006 it was adopted as an international standard ISO / IEC 26300. ODF is available to all and can be used without restrictions. Such a free alternative to the closed formats of Microsoft. In order to read and write the ODF format in Microsoft products, the Sun ODF Plugin for Microsoft Office plug-in was released. Support for ODF in Microsoft Office 2007 should be introduced with the release of Service Pack 2. Unfortunately, it is still inferior to the prevalence of RTF and DOC.

HTML

(from the English Hypertext Markup Language - "hypertext markup language")

Standard markup language for documents on the Internet (extension.htm / html). Web pages are created using HTML (or XHTML). HTML was developed by the British scientist Tim Berners-Lee in 1991 as a language for the exchange of scientific and technical documentation, suitable for use by people who are not experts in the field of imposition. The text with HTML markup should be reproduced on various devices without stylistic and structural distortions. However, later the active introduction of multimedia and graphic design has destroyed these plans. To view HTML-documents do not need special editors, enough standard tools built into the OS. By openness, indexability, convertibility and readability is superior to any other formats. Unfortunately, the schedule is saved in a separate folder. Internet Explorer   allows you to save text and graphics in one MNT document, but other browsers may not open a similar file.

СНМ (Compiled HTML)

SNM, in fact, is a set of compiled HTML documents, something like an archive from web pages, due to which its size is smaller. To view the utility, built-in Windows 98 / NT and higher is used. There are also third-party viewers. To create .chm files, you can use free remedy   HTML Help Workshop. Now actively used as a reference for various applications.

PDF

(Portable Document Format-Portable Document Format)

Cross-platform format of electronic documents created by Adobe Systems using a number of PostScript features. First of all it is intended for representation in an electronic kind of polygraphic production. You can use the official free program Adobe Reader, as well as programs of other developers. Convenient is that the problem with floppy formatting, incorrect display of embedded graphic elements, lack of certain fonts is solved. The file on any platform will be displayed in the same form as it was created. The traditional way to create PDF documents is as follows: the document itself is prepared in its program, and then exported to PDF. Some programs have the ability to directly export (without using a virtual printer). For example, OpenOffice.org. In MS Word, there is no such option yet. The de facto standard for most documentation.

DjVu ("deja vu")

The technology of lossy image compression, designed specifically for storing scanned documents - books, magazines, manuscripts, etc., where the presence of formulas, diagrams, drawings and handwritten characters makes it extremely time consuming to fully recognize them. It is also an effective solution if you need to transfer all the nuances of the design, for example, historical documents. Very common, many libraries use it to store scanned scientific books. DjVu is sometimes called a "text-graphic" format. The essence of DjVu technology is the automatic splitting of images into several sections (for example, text, company logo and raster photo), for each of which an optimal compression algorithm is selected. In addition, the DjVu-file can contain a built-in interactive table of contents and active areas - links, which allows you to implement convenient navigation. Gives a win in the size of the file compared to the GIF-format on average a half to two dozen times.

XML-formats

("Extensible Markup Language")

There are quite a few text formats created for one particular device or program. For example, e-books. These include Rocket e-book (.rb), Microsoft Reader (.lit), PalmDoc, MobiPocket (.pro), etc. As a rule, they are all created using the XML language. The most successful and most common of these is the FictionBook format (FB2). At the moment this is the most progressive and promising format for e-books. Its only drawback is the long time spent in preparing the initial text. What pays off is the convenience of reading. In FictionBook, the emphasis is on structuring the document: using tags, you can select different text areas (chapters, headlines, citations, frames). How everything will look on the screen depends on the program-reader. If you want to draw a document in a certain way, you can attach a style sheet.

Do you like the article? Share with friends: