|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HOME | COURSES | TALKS | ARTICLES | GENERICS | LAMBDAS | IOSTREAMS | ABOUT | CONTACT | | | | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Character Types and Character Traits
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Character Types and Character Traits
C++ Report, April 1998
Character types have an impact on various classes in the standard library. Strings, iostreams, and facets are abstractions in the library that manipulate characters or character sequences. All of them are implemented as class templates that take the character type as a template argument. The motivation for this design is the desire to keep abstraction like strings, streams, etc. independent of the characters they handle. Consider string operations: Concatenation of two strings, for instance, is mostly independent of the type of the characters that form the two strings in question. Hence it is possible to implement entirely generic strings that can handle sequences of any kind of characters. That at least is the idea behind class templates like basic_string, basic_fstream, basic_streambuf, ctype, num_get, num_put , to name but a few of them. In practice it turns out that the character type alone does not provide enough information for all tasks that strings and iostreams perform. Streams, for instance, need to know how they can recognize the end of a file; after all the end-of-file-value might differ between character types. Also, strings must know how characters are compared, because the character type alone does not imply it. This additional information about a character type is encapsulated in yet another abstraction: the character traits. For use in the standard library each character type must be accompanied by an associated traits type. Otherwise the character type cannot be used for instantiating the string and stream class templates, because they require a second template argument, which is the traits type associated to the character type. Before we examine in detail how the character and traits type are used in strings, iostreams, and facets, we want to take a closer look at character types and character traits types in general. Character Types What is a character? We all know intuitively that it is something like a 'A' or a 'B'. The notion that we share, is the abstraction of a character. Such an abstract character has numerous properties: visual representations (glyphs), binary representations (codes), and many more. Now, what do we mean when we talk of characters in the context of the standard library? Let's get it straight: As programmers, we deal with characters inside our C++ programs. In that context a character is an object , in the sense that it is an instance of a type, either a built-in or a user-defined type like a class. The built-in character types in (C and) C++ are type char for narrow characters and type wchar_t for wide characters. Like other objects in our program, character objects do not only have a type, but also an individual object state. A character object's state is the content of the character, i.e. its binary representation. It is the bit pattern stored inside a storage unit of type char or wchar_t , for instance. The content of a character is also called a character code. A code usually belongs to a character encoding, which is a set of character codes along with rules for their interpretation. In this article we are talking of character objects. Keep in mind that whenever we mention a "character" in the following text we mean a "character object inside a C++ program". Character Type vs. Character Encoding A character has two aspects that are relevant in a C++ program:
As you can see, there is no 1:1 relationship between the
character type that describes the storage units used for storing a character,
and the character encoding used to represent the code contained in that
storage unit. Instead, a character sequence of a given encoding is stored
in an array of units that have the minimum size required to hold any character
of the encoding. The types typically used for storage of characters are
the built-in character types
char
and
wchar_t
;
one-byte character encodings are stored as
char
,
and wide character encodings are stored as
wchar_t
.
The table below shows examples of tiny and wide character encodings and
the character type that is typically used for storing and processing them:
Requirements to Character Types Let us return to the character type parameter of strings, iostreams, and facets. Potential candidates for the character type are, of course, the built-in types char and wchar_t . User-defined types are allowed, too. The Japanese delegation on the ISO committee standardizing C++ brought up the notion of Jchar , a character type that encapsulates information specific to processing of Japanese character representations. Jchar would be a user-defined type that can be used for instantiation of class templates like basic_string, basic_fstream, ctype , etc. Naturally, not just any type can serve as a character type. User-defined character types must meet the following requirements:
For certain purposes (details below) the character type must provide additional functionality:
Character Traits Types A character traits type provides information associated to a character type. In order to be used for instantiation of string or iostreams class templates a traits type has to provide a set of member typedefs and functions, some of which are predominantly used by strings, other being mostly used by iostreams. Requirements to Character Traits Types Below we give an overview of the member typedefs and functions required of a character traits type, grouped by topics. Copying, Finding, and Comparing Characters
The character traits are required to provide a number
of member functions for typical operations on characters and character
sequences. These are:
They are mostly used by the string classes in standard library. Handling the end-of-file Character
The character traits are required to provide types and
functions for handing the end-of-file-value of a character type. This information
is used by the iostreams classes in the standard library. Here is an overview:
The end-of-file character is a special character that is different from all other character values. Historically, the end-of-file value was EOF , which is a constant of type int, that is different from all character values of type char . In the standard iostreams this principle was generalized. The end-of-file character of a character type is provided by the character traits in form of a static member function eof (). The end-of-file value's type is defined as a type nested in the character traits called int_type. Note, that it usually is different from the character type itself, which is defined as char_type. Two values of type char_type or int_type cannot simply be compared by means of the built-in equality operator, because char_type and int_type can be any arbitrary type. Instead they are compared via the eq_int_type() and eq () member functions. The traits also have to provide functions for conversion between values of char_type and int_type and a convenience function for certain stream operations that returns an int_type value that is guaranteed to be different from the end-of-file value. Conversion State and Stream Positions
A character traits type has to provide typedefs related
to character code conversions and stream positioning. These are:
A discussion of character code conversions and stream
positioning is beyond the scope of this article.
The Predefined Character Traits The standard library provides two predefined traits types for the built-in character types char and wchar_t . These two types are specialization of a class template called char_traits<class charT>. Here are the declarations of these predefined traits types as they appear in the header file <string>: template<> struct char_traits<char>; template<> struct char_traits<wchar_t>; Interestingly, the char_traits<class charT> template itself is an empty class template. It sole purpose is to serve as a primary template for specializations. It is not supposed to be instantiated for any character types. A traits type for a user-defined character type would be a specialization, not an instantiation of the char_traits template. The empty character traits class template is used as a default template argument for the class templates requiring a traits type argument. Let's take a look at some typical examples: class basic_string; template <class charT, class traits = char_traits<charT> > class basic_fstream; Imagine you would define a new character type myChar . Then you would have to provide an associated traits type. If you defined it as a specialization of the char_traits template, i.e. as char_traits<myChar> , then the default would apply and a myChar -string would be of type basic_string<myChar> . Alternatively you could give the traits type a name of its own, say myCharTraits . In that case the traits type argument could not be omitted, i.e. you would have to say basic_string<myChar,myCharTraits> instead of just basic_string<myChar> . For this reason it is recommendable to define the traits type associated to a character type as a specialization of the char_traits template. There is only one situation when you would want to define traits types that are not specializations of the char_traits template: when you define more than one traits type for the same character type. Usage of Character Types and Traits Types We mentioned earlier that strings and iostreams rely on different parts of the character traits and that facets do not rely on the character traits at all, but make additional requirements to the character type. To aid understanding of these differences, let us get a rough impression of the way the character traits are used in the implementation of strings, iostreams, and facets. Strings The implementation of the basic_string class template uses the traits member typedefs and functions, whenever it manipulates characters and character sequences. For instance, traits::eq() is used for comparison in string functions such as find() and rfind() ; traits::compare() is used for implementaion of string compare() functions, and so on. Here is a simplified example, that shows use of traits::copy() and traits::assign() in one of the append() member functions of basic_string : class basic_string { public: typedef typename allocator::size_type size_type; private: charT *_ptr; size_type _len; public: basic_string& append(const charT *s, size_type m) {..... size_type n; if (0 < m && _extend(n = _len + m)) {traits::copy(_ptr + _len, s, m); traits::assign(_ptr[_len = n], charT(0)); } return (*this); } }; IOStreams Iostreams classes rely on the character traits, too. They use the traits member typedefs and functions that have to do with the end-of-file value. Below is a typical example. Numerous member functions of stream and stream buffer classes return the end-of-file value in case of failure and a valid character in case of success. The get() functions for unformatted input of characters show the principle: class basic_istream : virtual public basic_ios<charT, traits> { public: typedef typename traits::int_type int_type; int_type get() {int_type c; ... if (!_Ok) c = traits::eof(); else {... c = rdbuf()->sbumpc(); ... } return (c); } basic_istream<charT, traits>& get(charT& x) {int_type c = get(); if (!traits::eq_int_type(traits::eof(), c)) x = traits::to_char_type(c); return (*this); } }; Iostreams classes additionally use the traits members that relate to stream positioning and code conversion. Note, that the standard does not guarantee that iostreams classes restrict themselves to the traits member typedefs and functions related to the end-of-file value and stream positioning and code conversion. They are allowed to also make use of member functions like compare(), assign(), etc. Similarly, strings could theoretically use eof(), not_eof(), pos_type, off_type , etc., although this is unlikely in practice. The principle is that strings and iostreams are permitted to rely on the full set of properties that are required of character types and their traits. Facets The facet class templates do not have a traits parameter. This is because many of the facets really do not manipulate characters. Think of the ctype facet for instance: It classifies characters according to their properties, i.e. whether they are digits, white spaces, printable, lower case, upper case letters, etc. There is no need for ever really touching a character object, copying it, or comparing it. This is different for the parsing and formatting facets like num_get, num_put, money_get, money_put, time_get, and time_put . They have to compare characters. Just think of parsing of numeric values: the num_get facet has to recognize that an input character is the radix character or the thousands separator. It will therefore compare an input character to the respective symbols defined in the numpunct facet. Comparison is defined by the character traits in form of the eq() function, but the facets do not know anything about the associated traits type. Instead of using traits::eq() , they perform the comparison of two characters by means of operator==(). Here is a code snippet that could be part of the num_get facet's do_get() function: class num_get : public locale::facet { protected: virtual iter_type do_get(iter_type beg, iter_type end, ios_base& iob,ios_base::iostate& err, long& v) const {... char_type ct = *in ; char c; if ( ct == use_facet<numpunct<charT> >(iob.getloc()).decimal_point() ) c = ’.’; bool discard = ( ct == use_facet<numpunct<charT> >(iob.getloc()).thousands_sep() && use_facet<numpunct<charT> >(iob.getloc()).grouping().length() != 0 ); ... } }; An interesting side effect is that iostreams classes generally use the traits::eq() function for comparison of characters, but for parsing and formatting of numeric values they use operator==(). This is because parsing and formatting of numeric values is delegated to the stream's locale's numeric facets, and we've seen above that facets do not use character traits. It follows that one should better implement an operator==() for a user-defined character type that has the same semantics as the character type's traits::eq() function. One problem remains: You can only have one operator==() for a given character type, but several character traits types associated to that character type. What if the traits types have different eq() functions? Iostreams might yield "interesting" results under these circumstances. However, this problem is unlikely to occur in practice. Consider also, that iostreams classes do not only rely on certain properties of the character type and the character traits type, but additionally require facets for that character type. Iostreams needs the numeric facets, as we've mentioned above. It also needs conversions between the character type and the built-in type char , which have to be defined in the ctype facet in form of the member functions narrow() and widen() . It also needs character classification functions from the ctype facet in order to identify white-space characters. A code conversion facet is needed for file streams. In turns out that you have to provide all standard facets for a new character type, because facets are generally allowed to be interdependent. Summary Character types play a significant role in the standard library as template arguments to strings, iostreams, and facets. Character types have to meet certain requirements and must be accompanied by an associated character traits type and by the standard facets for that character type. Acknowledgements We thank Nathan Myers, Bill Plauger, and Jerry Schwarz for their willingness to answer our questions and help interpreting the draft standard, and Kevlin Henney for his thorough review.
References
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||
© Copyright 1995-2007 by Angelika Langer. All Rights Reserved. URL: < http://www.AngelikaLanger.com/Articles/C++Report/CharacterTraits/CharacterTraits.html> last update: 10 Aug 2007 |