|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HOME | COURSES | TALKS | ARTICLES | GENERICS | LAMBDAS | IOSTREAMS | ABOUT | CONTACT | | | | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The Standard Facets
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The Standard Facets
C++ Report, November/December 1997
Internationalization is building into a program the potential for worldwide use. Nowadays, it is a common task in almost every product development process. Internationalization is supported in various forms by programming languages, operating systems, and development tools. Traditionally, internationalization is done by means of the standard C library or equivalent C APIs such as the Win32 NLSAPI on Microsoft platforms or the X/Open NLS support on Unix platforms. Naturally, the C++ standards committee did not want to stand back and included internationalization support into the standard C++ library: a locale class was added and its use was demonstrated by internationalizing the standard iostreams. In our last contribution to this column (see /1/) we discussed the architecture of standard locales. Here is a brief recap: The standard C++ library provides an extensible framework for support of internationalization. Its main elements are locales and facets . A locale is a class that represents a container of facets; a facet is a class that contains information and provides functionality related to a certain aspect of internationalization. Access to a facet that is contained in a locale is via a template function called use_facet<facet>(loc) . The template argument facet is a facet class, and the function argument loc is a locale object; returned is a constant reference to the object of class facet contained in the locale. Last time we described the locale framework’s architecture in detail and discussed the design of the locale and facet classes. The standard library does not only provide the locale framework, but also contains a number of facet classes. In this article we explain which facets the standard already provides and what functionality they have. In a subsequent article we will demonstrate how one can use the locale framework to build and integrate a new user-defined, special-purpose facet.
Before we delve into the details of a certain standard
facet, please have a look at the overview of the internationalization aspects
the committee found important enough to be standardized. They are summarized
in
Table 1
.
Now let us see how these facets help to cope with cultural differences. The following sections discuss problem areas related to differences in language and alphabet and tasks concerning culture-dependent representations of numbers, monetary amounts, date and time. We take a look at the problem domain first and then describe how the standard facets address these problems. Language and Alphabet. Different ethnic groups use different languages . Hence the language is one of the most apparent differences between cultures. Even within a single country people might prefer different languages. The Swiss for example use French, Italian, and German.
Languages also differ in the
alphabet
they use.
Here are a couple of examples of languages and their respective alphabets:
We want to spare you the details about character encodings and codesets that can be used to represent different alphabets. However, we want to discuss, at least briefly, the different possibilities to represent alphabets with a large number of characters. There are two possible approaches for encoding large alphabets: character encodings that mix characters of different size ( multibyte character encodings ); or character encodings where all characters are of the same size ( wide character encodings ). It is common practice when handling large alphabets to use wide character encodings inside the program and multibyte character encodings outside on the external device.
Different languages have different rules for sorting characters
and words. These rules are called
collating sequence
. The collating
sequence specifies the ordering of individual characters and other rules
for ordering. In software development the order of characters is often
determined by the numeric value of the byte(s) representing a character.
This is what we call ASCII rules in the examples below. This kind of ordering
does not meet the requirements of any language’s dictionary sorting. Here
is an example for ASCII collation compared to language dictionary sorting:
In an ASCII encoding the numerical values of upper letters are smaller than the values of lower letters. For this reason, all words with capital letters appear at the beginning of a list sorted according to ASCII rules.
In some languages certain groups of characters are clustered
and treated like a single character for the purpose of sorting characters.
In other languages it is the other way round; one character is treated
as if it were actually two characters. Here is an example for one character
treated as two:
The German character ß, called sharp s, is treated as if it were two characters, namely ss. Character classification and conversion.
Lets start our detailed examination of the standard facets
with the
ctype
facet, defined by
template
<class charT> class ctype
. Among other services,
it provides the functionality to classify the characters of a character
set. Criteria for this classification are provided as an enumerated bit
set type, which is called
mask.
It is a nested type in
ctype_base,
the public base class of the ctype facet template. The values of
mask
and their semantics are listed in
Table 2
.
Member functions of the ctype facet provide the functionality
Another type of conversion that ctype supports is the conversion between ctype’s template character type charT and the built-in character type char . This functionality is provided by the member functions narrow() and widen(). For each function two overloaded versions exist; one that converts single characters and one that converts character ranges. For efficiency reasons the standard requires that ctype<char> must be provided as a template specialization. Its implementation must be based on a table where the character encoding is the key and the value is a bit mask value of type ctype_base::mask. The bit mask values indicate all criteria to which the character conforms. For example, a lower case letter such as ‘k’ is associated to the bit mask value: ctype_base::alpha | ctype_base::lower | ctype_base::print .
This table driven approach allows to implement most of
the member functions as simple and efficient bit operations.
String collation. The collate facet, defined by template <class charT> class collate , supports the comparison of strings according to language specific rules. Its member function:
int compare(const charT*
low1, const charT* high1,
returns an integer value that indicates the order of two character sequences [low1, high1) and [low2, high2) respectively: -1, indicates that the first sequence is less than the second, and 0, indicates that both sequences are equal.
Note that the standard string operations are not internationalized.
A string compare operation in class
basic_string
is a character-by-character comparison, which for instance is not sufficient
for the interpretation of a single character as two characters, that is
required in some languages. The consequence is that for internationalized
programs the respective member functions of the
basic_string
template class cannot be used for comparison of strings; instead the functionality
of the collate facet is needed.
Code conversion. The codecvt facet, defined by template <class internT, class externT , class stateT> class codecvt , supports conversion between two character codesets. This is needed when the internal and the external character set of a program differ. The template parameters are:
The member function
in()
is used for the conversion from the internal to the external character
set,
out()
for the conversion from the external to the internal. Both functions take
an input character sequence and convert it to an output character sequence.
For state dependent conversions they also maintain the conversion state.
Message Catalogs. The messages facet, defined by template <class charT> class messages , supports the retrieval of user-defined localized messages from message catalogs. Its interface allows to open and close a message catalog identified by a catalog name and to retrieve a message from an open catalog.
The upcoming C++-standard describes how message catalogs
can be used via the messages facet’s interface. The syntax of message catalogs,
as well as the way message catalogs have to be installed and maintained,
are beyond the scope of the standard and implementation-specific.
Representation of Numbers, Monetary Amounts, Date and Time. Numbers are represented according to cultural conventions. For example, the symbol used for separation of the integer portion of a number from the fractional portion, the so-called radix character , can differ from country to country. In American English, this character is a period; in most European countries, it is a comma. Conversely, the symbol that groups numbers with more than three digits, the so-called thousands separator , is a comma in American English, and a period in much of Europe.
Even the grouping of digits varies. In American English,
digits are grouped by threes. In Nepal for instance, the first group has
three digits, all subsequent groups have two digits.
Similarly units of currency are represented in different
cultures in different ways. The currency symbol can vary, its placement,
as well as the format of negative currency values. For example, there are
two different ways of representing the same amount in US dollars:
Here is an example that shows different cultural conventions
for placing the currency symbol:
Obviously the representation of time and date depend on cultural conventions. The names and abbreviations for days of the week and months of the year vary with the language. Also, some countries use a 24-hour clock; others use a 12-hour clock. Even calendars differ; they are based on historical, seasonal, and astronomical events. The official Japanese calendar, for instance, is based on a historical event, the beginning of the reign of the current Emperor. Many countries, especially in the Western World, use the Gregorian calendar instead.
Here are examples of representations of the same date
in different countries. They differ in order of day, month, and year, the
separators between those items, and the use or omission of item such as
the weekday in the long form of the date in Hungarian.
Numeric and Boolean values. The localization information and functionality related to numeric and Boolean expressions is handled by three standard facets:
Based on the information contained in the numpunct facet the facets num_put and num_get provide the functionality to generate a formatted character sequence from a numeric or Boolean value, and the reverse functionality: parsing of a character sequence to extract a numeric or Boolean value. num_put does the formatting, num_get the parsing. The second parameter of the num_put and num_get template, OutputIterator and InputIterator respectively, is used to specify the character sequence. num_put provides overloaded versions of its member function put() for formatting of the following types: bool , long , unsignedlong , double , longdouble , void*. num_get provides overloaded versions of get() for storage of the extracted value into the types: bool , long , unsignedlong , unsignedint , unsignedshort , float , double , longdouble , void*. At first it might look as though versions for int or float were missing. But the intention was to keep the interface of the standard library concise, and a value of type int can be handled by the version for long .
One of the parameters to
put()
and
get()
is a reference to an
ios_base
object. The format flags contained
in this object are used to determine the format specifications for formatting
and parsing. The semantics of the flags is the same as in standard iostreams.
In fact, the formatting layer of iostreams uses the num_put and num_get
facets for its formatting and parsing. An otherwise necessary type conversion
is avoided because the iostreams operations pass the format specifications
to the facets in form of an
ios_base
object. The general use of num_put
and num_get, however, is not limited by these design decisions; they may
well be used in a context other than iostreams. The benefit of the integration
of facets into standard iostreams is that i/o-operations for Boolean or
numeric values are already internationalized.
Monetary values. The localization information for monetary values is organized in a similar fashion as the localization information for numeric values. There is one facets that holds the localization-dependent information. Based on this facet are two facets that provide the functionality for formatting and parsing of character sequences that represents monetary values. The facets are:
Like numpunct, moneypunct’s member functions provide the information about grouping of numeric value, about the characters used as radix separator and as thousands separator. Additionally moneypunct can tell how many digits are represented after the radix separator, which string forms the currency symbol, and how a negative and positive monetary amount is structured.
money_put contains two overloaded versions of the
put()
member function. One allows to format a value of type
longdouble
to a representation of a monetary value, the other takes a references to
basic_string<charT>
.
money_put’s overloaded member function
get()
does the reverse operation: it parses a character sequence that represents
a monetary amount and stores the extracted value in either a
long
double
or a
basic_string<charT>.
Date and time values. Two facets handle the localization functionality for date and time:
time_get() provides several member functions that can
parse a character sequence and return each specific date and time components
in a
struct tm
.
Examples for member functions are:
get_month()
or
get_weekday()
,
which extract from the character sequence a value representing a month
or a weekday respectively and store it in a
struct
tm
.
byname Facets In the sections above we structured the description of the facets according to the way they address a certain localization aspect. However, there is another way to structure them:
Locale myLocale("En_US"); creates a locale that represents the US localization environment, and we can be sure that in the code shown below rs will be initialized with ‘.’ :
char rs = use_facet<
numpunct<char> >(myLocale).decimal_point();
Base Class Facets After this discussion of the behavior of the byname facets, which are derived facet types, lets have a look at the behavior of the base class facets. No further explanation is needed for num_put, num_get, money_put, money_get. As described above they define functionality rather than holding locale-sensitive information; and the base classes implement this functionality. Some base class facets provide classic "C" behavior. Classic "C" means the way C functions used to behave before internationalization was added to the C standard. Facet base classes with classic "C" behavior are: ctype, collate, numpunct. Obviously, classic "C" does only describe a behavior for the character type char . As we will see below, these three facets need to be provided for the character types char and wchar_t . The behavior for wchar_t is analogous to the classic "C" behavior for char . For instance, numpunct<wchar_t>::decimal_point() returns L‘.’ where numpunct<char>:: decimal_point() returns ‘.’ . The base classes of the following facets have implementation defined behavior: messages, moneypunct, time_get and time_put. This is because the standards committee, as an international forum, did not want to dictate one nation’s preference as a default for all other nations. For instance, there is no universally accepted pattern to represent a monetary amount. Therefore; they did not define a base class behavior. In the case of code conversion two codecvt base class facets must be provided by a standard compliant library. The facet codecvt<char,char,mbstate_t> is a degenerated one; it implements "no conversion", so that in() and out() behave very similar to a memcpy(). The behavior of codecvt<char,wchar_t,mbstate_t> is implementation defined. Usually, interfaces with implementation defined behavior have to be avoided by users who strive for portability of their programs. Hence, one might wonder whether it is a problem that the base class behavior is implementation defined for some facets. The answer is: No, not really. In an internationalized application one will usually use the byname facets, because they provide localized information and functionality dependent on a specified cultural context. The behavior of a base class facet is of interest only when a new derived facet with a new behavior shall be implemented for an existing facet interface, and the existing base class behavior shall be reused, if possible. The byname facet objects are powerful and already provide support for all common localization environments. So, only when an exotic behavior is needed, the derivation of a new facet type is necessary at all. In such a case it is very likely that the new functionality must be implemented from scratch and cannot be built reusing the base class behavior. Hence the base class behavior is almost irrelevant because most likely it will be overwritten anyway.
Speaking of derivation and overwriting functions: all
standard facets follow the idiom that a non-virtual public member function
calls a virtual protected member function, which implements the functionality.
A derived class must then redefine the protected function, not the public
one. The rationale behind this idiom is that a vendor might place code
for system specific functionality in the public member function. A user,
who derives from such a class, need not know and bother with the system
specific issues, but can simply provide the new functionality by overwriting
the protected member function. An example for system specific functionality
put into the public member function is the use of a mutex for multi-thread
support.
The standard requires that the facet classes and class
templates shown in
Table 3
must be provided by a standard
compliant implementation. It is up to the vendor how they are provided:
as templates, or as (partial) specializations.
Summary A standard compliant C++ library does not only provide a framework for internationalization support, consisting of locale and facet classes, but also provides a number of standard facet classes. This article gave an overview of the functionality of the standard facets along with an idea of the problem domain addressed by that functionality. A subsequent article will show how the locale framework can be extended by adding new, non-standard facets types.
References
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
© Copyright 1995-2007 by Angelika Langer. All Rights Reserved. URL: < http://www.AngelikaLanger.com/Articles/C++Report/StandardFacets/StandardFacets.html> last update: 10 Aug 2007 |