{"id":371,"date":"2011-05-30T19:22:01","date_gmt":"2011-05-30T17:22:01","guid":{"rendered":"http:\/\/www.retibus.com\/en\/blog\/?p=371"},"modified":"2016-04-01T20:10:59","modified_gmt":"2016-04-01T18:10:59","slug":"a-code-point-iterator-adapter-for-c-strings-in-utf-8","status":"publish","type":"post","link":"http:\/\/www.nubaria.com\/en\/blog\/?p=371","title":{"rendered":"A code point iterator adapter for C++ strings in UTF-8"},"content":{"rendered":"<p>As the last post in this series I&#8217;ve been writing on Unicode and UTF-8, I thought I would elaborate on an interesting idea I mentioned in my previous post. When discussing how a <code>std::string<\/code> object that stores UTF-8 text is just a sequence of raw bytes rather than Unicode code points, I hinted that it wouldn&#8217;t be difficult to write a special iterator class for those situations where we may need to traverse the code point values rather than the bytes. In this post I explain how to write such an iterator class.<\/p>\n<p><!--more--><\/p>\n<p>Note that most of the time, say when we want to iterate through a string to find a particular character or substring, the ordinary <code>std::string::iterator<\/code> and <code>const_iterator<\/code> objects are fine. In previous posts I already commented on the interesting properties of UTF-8 that make such operations safe. But there are still times when we may need to check the particular numeric values of the code points that make up a Unicode string. For example, the XML specification restricts the code points that are allowed for many tokens, so a good XML parser would need to check the validity of code points. Or we may want to identify text in a particular language. As an example of this, let&#8217;s think up an exercise: how could we check if a UTF-8-encoded string contains any Armenian text? The Unicode character charts show that the range of code points assigned to the Armenian script goes from 0x530 to 0x58F, so we will need a way to extract the numeric values from the UTF-8 text as we traverse it.<\/p>\n<p>What we need for this is an adapter-style iterator class, let&#8217;s call it <code>Utf8Iterator<\/code>, that we can initialize with a <code>std::string::const_iterator<\/code> so that, on the one hand, each iteration would make the internal iterator advance as many bytes as make up a full character and, on the other hand, dereferencing the iterator would yield a Unicode code point as a 32-bit value. If we manage to write such a class, checking for the appearance of Armenian text would be as easy as pie:<\/p>\n<pre class=\"brush: cpp; title: ; notranslate\" title=\"\">\r\nbool TextHasArmenianCharacters(const std::string&amp; text)\r\n{\r\n\tbool found = false;\r\n\t\r\n\tfor(Utf8Iterator codePointIterator = text.begin(); !found &amp;&amp; codePointIterator != text.end(); ++codePointIterator)\r\n\t\tfound = *codePointIterator &gt;= 0x530 &amp;&amp; *codePointIterator &lt;= 0x58f;\r\n\r\n\treturn found;\r\n}\r\n<\/pre>\n<h2>1. The Adapter pattern<\/h2>\n<p>The class we want should manage the core string iterator object by making it move forward or backward by up to four steps for every code point (a Unicode code point is made up of four bytes at most in the UTF-8 encoding) and provide a dereferencing mechanism that returns a 32-bit value rather than the plain <code>char<\/code>s that <code>std::string<\/code> uses. This is an example of <a href=\"http:\/\/en.wikipedia.org\/wiki\/Adapter_pattern\">the Adapter design pattern<\/a>. The new class should aggregate the plain string iterator and rely on it for traversing the string, but adapting the semantics and the state of each iteration to the knowledge of Unicode code points, a concept that dumb <code>std::strings<\/code> are completely ignorant of.<\/p>\n<p>This adapter class will only be used for reading values, not for overwriting the <code>std::string<\/code> object, so it should work like a <code>const_iterator<\/code>. We can start writing this adapter as a class that stores a <code>std::string::const_iterator<\/code> assigned on construction. In the header file we will have:<\/p>\n<pre class=\"brush: cpp; title: ; notranslate\" title=\"\">\r\nclass Utf8Iterator\r\n{\r\npublic:\r\n\r\n\tUtf8Iterator(std::string::const_iterator it);\r\n\r\n\tUtf8Iterator(const Utf8Iterator&amp; source);\r\n\r\n\tUtf8Iterator&amp; operator=(const Utf8Iterator&amp; rhs);\r\n\r\n\t~Utf8Iterator();\r\n\r\nprivate:\r\n\r\n\tstd::string::const_iterator mStringIterator;\r\n};\r\n<\/pre>\n<p>The above code is a basic skeleton of a class that uses another class by aggregation and supports the ordinary copy semantics. I have decided not to make the constructor <code>explicit<\/code> so that <code>std::string::const_iterator<\/code>s can be cast directly to the <code>Utf8Iterator<\/code> type. If the safety of explicit initialisation is preferred, the keyword <code>explicit<\/code> should be added in front of the constructor. We could also add an empty constructor if we wanted to be able to declare these objects before initialising them (I have added such a constructor to my own implementation), but that&#8217;s usually not necessary. In the .cpp file we will have:<\/p>\n<pre class=\"brush: cpp; title: ; notranslate\" title=\"\">\r\n\r\nUtf8Iterator::Utf8Iterator(std::string::const_iterator it) :\r\nmStringIterator(it)\r\n{\r\n}\r\n\r\nUtf8Iterator::Utf8Iterator(const Utf8Iterator&amp; source) :\r\nmStringIterator(source.mStringIterator)\r\n{\r\n}\r\n\r\nUtf8Iterator&amp; Utf8Iterator::operator=(const Utf8Iterator&amp; rhs)\r\n{\r\n\tmStringIterator = rhs.mStringIterator;\r\n\treturn *this;\r\n}\r\n\r\nUtf8Iterator::~Utf8Iterator()\r\n{\r\n}\r\n\r\n<\/pre>\n<p>Apart from the possibility of adding an empty constructor, it may be necessary to add a constructor that takes a <code>std::string::iterator<\/code> parameter to get round some type conversion issues. That additional constructor, not shown here, should look exactly the same as the one that uses <code>const_iterator<\/code>.<\/p>\n<h2>2. A bidirectional iterator<\/h2>\n<p>Now we want this class to behave as an iterator, but what kind of iterator exactly? The C++ standard library differentiates between several kinds of iterators, depending on which operations are allowed. For example, if <code>it<\/code> is a <code>std::vector::iterator<\/code> we can use <code>it + 3<\/code> to advance the iterator by three steps, or <code>it - 2<\/code> to move backwards in the container by two steps. Provided that we don&#8217;t move beyond the valid range (and it is up to our code, not to the compiler, to check that) these expressions can be dereferenced to access the values at those positions. This kind of very versatile iterator is a random-access iterator and it is also the kind of iterator used by <code>std::string<\/code>. But if <code>it<\/code> is a <code>std::list::iterator<\/code> instead, then the compiler will complain with an error when it finds an expression like <code>it + 3<\/code>. This is because the <code>std::list<\/code> uses a more limited kind of iterator called &#8216;bidirectional iterator&#8217;, which only allows for one increment or decrement at a time through expressions like <code>++it<\/code> or <code>--it<\/code>. The standard C++ library provides random-access iterators when multiple increments can be carried out in constant time, but not when it is a linear-time operation. Such a design actively discourages iteration idioms that would be inefficient. That&#8217;s why the <code>std::list::iterator<\/code>s aren&#8217;t random-access, even if it would be straightforward to implement <code>operator+(int)<\/code> by calling <code>operator++<\/code> repeatedly.<\/p>\n<p>Since the <code>std::string<\/code> uses random-access iterators, we may be tempted to think that <code>Utf8Iterator<\/code> should also be a random-access iterator. But if we think about this carefully, we&#8217;ll find that it is not possible to implement a codepoint-based iterator in such a way that random access is possible at constant time. A <code>std::string<\/code> is just a sequence of bytes with a fixed size, so moving N steps ahead is just a matter of adding N to a pointer, but we can&#8217;t advance N code points in a UTF-8 sequence by such simple arithmetic. Because each code point may span between one and four bytes, we need to scan all the bytes in succession and count the steps until we reach N increments. This is a linear-time, not constant-time, operation, just like traversing a linked list. Because of this limitation on random access, it makes sense to treat a codepoint-based iterator as a bidirectional one. Also, I can&#8217;t think of any real situation where we would want to move the current iterating position in a UTF-8 sequence by a certain number of code points. So, both in terms of efficiency and usability it makes sense to implement this sort of class as a bidirectional iterator.<\/p>\n<p>Being a bidirectional iterator, we can ignore those operators that add or subtract integer values and only need to implement the pre and post versions of the <code>++<\/code> and <code>--<\/code> operators:<\/p>\n<pre class=\"brush: cpp; title: ; notranslate\" title=\"\">\r\n\r\nUtf8Iterator&amp; Utf8Iterator::operator++()\r\n{\r\n\t\/\/TO DO: Advance mStringIterator by the size in bytes of the current code point.\r\n\treturn *this;\r\n}\r\n\r\nUtf8Iterator Utf8Iterator::operator++(int)\r\n{\r\n\tUtf8Iterator temp = *this;\r\n\t++(*this);\r\n\treturn temp;\r\n}\r\n\r\nUtf8Iterator&amp; Utf8Iterator::operator--()\r\n{\r\n\t\/\/TO DO: Move mStringIterator back by the size in bytes of the preceding code point.\r\n\treturn *this;\r\n}\r\n\r\nUtf8Iterator Utf8Iterator::operator--(int)\r\n{\r\n\tUtf8Iterator temp = *this;\r\n\t--(*this);\r\n\treturn temp;\r\n}\r\n\r\n<\/pre>\n<p>The above code contains the trivial part of the implementation and two &#8216;TO DO&#8217; comments for the part that requires knowledge of the UTF-8 syntax rules. We will come back to that later.<\/p>\n<h2>3. The current code point value<\/h2>\n<p>An iterator class has a state, which is the current position on the sequence it manages. In the general Iterator pattern as described in the classic 1995 Design Patterns book (see <a href=\"#references\">the references below<\/a>), iterator classes have a <code>CurrentItem<\/code> method to access the item where the iteration process happens to be. In the C++ standard library, the syntax for iterators is based on pointers (and the most straightforward implementation for random-access iterators are indeed pointers), so the current item in the standard C++ iterators is accessed through the dereferencing <code>operator*<\/code>, which is also the syntactic approach I&#8217;ve followed in the initial Armenian problem. So, by implementing the dereferencing operator we can provide a natural way to access the current Unicode code point value. Which type should we use for this value? Before the new C++0x standard, any unsigned integer at least 32 bits in size would do. In the soon-to-be-official C++0x standard, already partially implemented by the main compilers, there is the char32_t type, which is ideal both in terms of size and semantics for this. This is the type I will use for the code point values. If you&#8217;re using a compiler that doesn&#8217;t support this new standard type, you can substitute any 32-bit unsigned integer type.<\/p>\n<p>So, we can now add <code>operator*<\/code> to our code. Since this class is only for reading, we only want to implement the <code>const<\/code> version. In the .cpp file:<\/p>\n<pre class=\"brush: cpp; title: ; notranslate\" title=\"\">\r\n\r\nchar32_t Utf8Iterator::operator*() const\r\n{\r\n\tchar32_t currentCodePoint = 0;\r\n\r\n\t\/\/TO DO: Calculate the code point at the position pointed to by mStringIterator. \r\n\r\n\treturn currentCodePoint;\r\n}\r\n\r\n<\/pre>\n<p>As we did before, we&#8217;ll concentrate on the structure of the class first, and leave the nitty-gritty for later.<\/p>\n<h2>4. Adding comparison operators<\/h2>\n<p>When we use an iterator class we usually have to compare an iterator that is traversing a container with other iterators that point to particular positions in the container, typically the beginning and the end. Such operations require that the comparison operators for equality and difference, <code>==<\/code> and <code>!=<\/code>, be defined. We can add these operators to the class in a straightforward way by comparing the internal iterators. In the .cpp file we will have:<\/p>\n<pre class=\"brush: cpp; title: ; notranslate\" title=\"\">\r\n\r\nbool Utf8Iterator::operator==(const Utf8Iterator&amp; rhs) const\r\n{\r\n\treturn mStringIterator ==  rhs.mStringIterator;\r\n}\r\n\r\nbool Utf8Iterator::operator!=(const Utf8Iterator&amp; rhs) const\r\n{\r\n\treturn mStringIterator !=  rhs.mStringIterator;\r\n}\r\n\r\n<\/pre>\n<p>Since we will often want to compare these code point iterators against the iterators returned by the <code>std::string::begin()<\/code> and <code>std::string::end()<\/code> methods, it makes sense to add overloaded versions of these operators in order to avoid relying on too many unnecessary type conversions and temporary objects, like this:<\/p>\n<pre class=\"brush: cpp; title: ; notranslate\" title=\"\">\r\n\r\nbool Utf8Iterator::operator==(std::string::iterator rhs) const\r\n{\r\n\treturn mStringIterator ==  rhs;\r\n}\r\n\r\nbool Utf8Iterator::operator==(std::string::const_iterator rhs) const\r\n{\r\n\treturn mStringIterator ==  rhs;\r\n}\r\n\r\nbool Utf8Iterator::operator!=(std::string::iterator rhs) const\r\n{\r\n\treturn mStringIterator !=  rhs;\r\n}\r\n\r\nbool Utf8Iterator::operator!=(std::string::const_iterator rhs) const\r\n{\r\n\treturn mStringIterator !=  rhs;\r\n}\r\n<\/pre>\n<h2>5. Deriving from <em>std::iterator<\/em><\/h2>\n<p>The standard C++ library has a <code>std::iterator<\/code> template that defines some handy <code>typedef<\/code>s such as <code>value_type<\/code>. Actually, algorithms that work on iterators should not use these <code>typedef<\/code>s since pointers are also valid iterators (something like <code>std::vector<int>::iterator::value_type v;<\/code> can&#8217;t possibly compile if <code>std::vector<int>::iterator<\/code> happens to be an <code>int*<\/code>). The <code>std::iterator<\/code> template is however useful because there is another template, <code>std::iterator_traits<\/code>, which provides the right way for algorithms to access type information from iterators (<code>std::iterator_traits<vector<int>::iterator>::value_type v;<\/code> should always compile), and <code>std::iterator_traits<\/code> will be generated from <code>std::iterator<\/code> if no <code>iterator_traits<\/code> specialisation is provided. So, in order to save us writing an <code>iterator_traits<\/code> specialisation, it makes sense to derive the iterator class from <code>std::iterator<\/code>.<\/p>\n<p>Deriving from <code>std::iterator<\/code> has the additional advantage of making the purpose of the class clearer to casual readers of the source code, who will identify the <code>Utf8Iterator<\/code> class as a bidirectional iterator type straight away. So, in the header file we can modify the declaration of the class to add inheritance from <code>std::iterator<\/code>:<\/p>\n<pre class=\"brush: cpp; title: ; notranslate\" title=\"\">\r\nclass Utf8Iterator : public std::iterator&lt;std::bidirectional_iterator_tag, char32_t, std::string::difference_type, const char32_t*, const char32_t&amp;&gt;\r\n{\r\n\t&#x5B;...]\r\n}\r\n<\/pre>\n<h2>6. Applying the UTF-8 rules<\/h2>\n<p>We now have all the scaffolding in place, but we still have to write the non-trivial part of the implementation. This affects three methods: <code>operator*<\/code> to read the current code point, <code>operator++<\/code> to advance the internal iterator by one full code point, and <code>operator--<\/code> to move backward by one code point. The encoding rules of UTF-8 are explained in the standard specification for Unicode. Another very good reference is the English Wikipedia article (see <a href=\"#references\">the references below<\/a>).<\/p>\n<p>In order to implement <code>operator++<\/code> we need to check the pattern of the byte the internal iterator points to. If the leftmost bit is not set (i.e. it has a value below 128) then it must be a one-byte-long ASCII character. Otherwise, the initial byte must have a bit pattern that may begin with &#8216;110&#8217;, if it&#8217;s a two-byte code, &#8216;1110&#8217;, if it&#8217;s a three-byte code, or &#8216;11110&#8217; if it&#8217;s a four-byte code. We can implement this in the .cpp file as follows:<\/p>\n<pre class=\"brush: cpp; title: ; notranslate\" title=\"\">\r\nconst unsigned char kFirstBitMask = 128; \/\/ 1000000\r\nconst unsigned char kSecondBitMask = 64; \/\/ 0100000\r\nconst unsigned char kThirdBitMask = 32; \/\/ 0010000\r\nconst unsigned char kFourthBitMask = 16; \/\/ 0001000\r\nconst unsigned char kFifthBitMask = 8; \/\/ 0000100\r\n\r\n&#x5B;...]\r\n\r\nUtf8Iterator&amp; Utf8Iterator::operator++()\r\n{\r\n\tchar firstByte = *mStringIterator;\r\n\r\n\tstd::string::difference_type offset = 1;\r\n\r\n\tif(firstByte &amp; kFirstBitMask) \/\/ This means the first byte has a value greater than 127, and so is beyond the ASCII range.\r\n\t{\r\n\t\tif(firstByte &amp; kThirdBitMask) \/\/ This means that the first byte has a value greater than 224, and so it must be at least a three-octet code point.\r\n\t\t{\r\n\t\t\tif(firstByte &amp; kFourthBitMask) \/\/ This means that the first byte has a value greater than 240, and so it must be a four-octet code point.\r\n\t\t\t\toffset = 4;\r\n\t\t\telse\r\n\t\t\t\toffset = 3;\r\n\t\t}\r\n\t\telse\r\n\t\t{\r\n\t\t\toffset = 2;\r\n\t\t}\r\n\t}\r\n\r\n\tmStringIterator += offset;\r\n\r\n\treturn *this;\r\n}\r\n<\/pre>\n<p>To keep things simple, I have omitted checks for invalid UTF-8 syntax in the code. In production code, we would want to throw exceptions if an invalid UTF-8 sequence is found.<\/p>\n<p>Now let&#8217;s see how to write <code>operator--<\/code>. We will have to decrement the internal iterator and first check whether it is an ASCII value (an unset leftmost bit). If that&#8217;s the case, then the decremented internal iterator is already pointing to the previous code point. Otherwise we&#8217;ll have to decrement the internal iterator again up to three times until we find a bit pattern with the two leftmost bits set. Again, I omit any error-checking for ill-formed UTF-8:<\/p>\n<pre class=\"brush: cpp; title: ; notranslate\" title=\"\">\r\nUtf8Iterator&amp; Utf8Iterator::operator--()\r\n{\r\n\t--mStringIterator;\r\n\r\n\tif(*mStringIterator &amp; kFirstBitMask) \/\/ This means that the previous byte is not an ASCII character.\r\n\t{\r\n\t\t--mStringIterator;\r\n\t\tif((*mStringIterator &amp; kSecondBitMask) == 0)\r\n\t\t{\r\n\t\t\t--mStringIterator;\r\n\t\t\tif((*mStringIterator &amp; kSecondBitMask) == 0)\r\n\t\t\t{\r\n\t\t\t\t--mStringIterator;\r\n\t\t\t}\r\n\t\t}\r\n\t}\r\n\t\r\n\treturn *this;\r\n}\r\n<\/pre>\n<p>And now there&#8217;s only one method left, the dereferencing operator, which will have to make use of the UTF-8 rules to compose the code point value out of the bytes that make up the character. This is done as follows (excuse the magic numbers; feel free to take them out as constants if you want to use this code):<\/p>\n<pre class=\"brush: cpp; title: ; notranslate\" title=\"\">\r\nchar32_t Utf8Iterator::operator*() const\r\n{\r\n\tchar32_t codePoint = 0;\r\n\r\n\tchar firstByte = *mStringIterator;\r\n\r\n\tif(firstByte &amp; kFirstBitMask) \/\/ This means the first byte has a value greater than 127, and so is beyond the ASCII range.\r\n\t{\r\n\t\tif(firstByte &amp; kThirdBitMask) \/\/ This means that the first byte has a value greater than 191, and so it must be at least a three-octet code point.\r\n\t\t{\r\n\t\t\tif(firstByte &amp; kFourthBitMask) \/\/ This means that the first byte has a value greater than 224, and so it must be a four-octet code point.\r\n\t\t\t{\r\n\t\t\t\tcodePoint = (firstByte &amp; 0x07) &lt;&lt; 18;\r\n\t\t\t\tchar secondByte = *(mStringIterator + 1);\r\n\t\t\t\tcodePoint +=  (secondByte &amp; 0x3f) &lt;&lt; 12;\r\n\t\t\t\tchar thirdByte = *(mStringIterator + 2);\r\n\t\t\t\tcodePoint +=  (thirdByte &amp; 0x3f) &lt;&lt; 6;;\r\n\t\t\t\tchar fourthByte = *(mStringIterator + 3);\r\n\t\t\t\tcodePoint += (fourthByte &amp; 0x3f);\r\n\t\t\t}\r\n\t\t\telse\r\n\t\t\t{\r\n\t\t\t\tcodePoint = (firstByte &amp; 0x0f) &lt;&lt; 12;\r\n\t\t\t\tchar secondByte = *(mStringIterator + 1);\r\n\t\t\t\tcodePoint += (secondByte &amp; 0x3f) &lt;&lt; 6;\r\n\t\t\t\tchar thirdByte = *(mStringIterator + 2);\r\n\t\t\t\tcodePoint +=  (thirdByte &amp; 0x3f);\r\n\t\t\t}\r\n\t\t}\r\n\t\telse\r\n\t\t{\r\n\t\t\tcodePoint = (firstByte &amp; 0x1f) &lt;&lt; 6;\r\n\t\t\tchar secondByte = *(mStringIterator + 1);\r\n\t\t\tcodePoint +=  (secondByte &amp; 0x3f);\r\n\t\t}\r\n\t}\r\n\telse\r\n\t{\r\n\t\tcodePoint = firstByte;\r\n\t}\r\n\r\n\treturn codePoint;\r\n}\r\n<\/pre>\n<p>And that&#8217;s it. We can now test the <code>TextHasArmenianCharacters<\/code> function in the exercise we suggested at the beginning, and it should only return true if the supplied text contains any Armenian characters.<\/p>\n<h2>6. An optimisation: caching the current value<\/h2>\n<p>The implementation we have for the <code>Utf8Iterator<\/code> class should work correctly, but it has a performance drawback that we can avoid. Basically, if we have declared a <code>Utf8Iterator it<\/code>, then every time we dereference it (<code>*it<\/code>), the value will be recalculated. Since it is common to dereference an iterator more than once within the same iteration step (the <code>TextHasArmenianCharacters<\/code> example does it twice), it is better in terms of efficiency to cache the code point value, so that it need not be recalculated in multiple dereferences. This can be done by adding a new member to the class: <code>char32_t mCurrentCodePoint<\/code>. The code that we&#8217;ve written for <code>operator*<\/code> can then be moved to a utility private method <code>CalculateCurrentCodePoint<\/code>. We might be tempted to call this method from the increment and decrement operators, but there is a problem with that, which is that an iterator is not always dereferenceable. In particular, mStringIterator may be pointing to the end position of a string. That&#8217;s why typical iterating code usually compares iterator values with the one returned by the <code>std::string::end<\/code> method before attempting to dereference it. Because of this behaviour of iterators, we can only dereference the internal <code>mStringIterator<\/code> when the code point iterator itself is being dereferenced.<\/p>\n<p>But then we need a way to know whether the current code point value needs recalculation or not. We can do this through a &#8216;dirty&#8217; flag that must be set to false whenever the code point value needs to be reevaluated. We will simply need to set it to <code>true<\/code> after the initial assignment and when the iterator is incremented or decremented. <code>CalculateCurrentCodePoint<\/code> will return straight away if the &#8216;dirty&#8217; flag is false and recalculate the code point and reset the flag to <code>false<\/code> if it is <code>true<\/code>. Both the cached value and the &#8216;dirty&#8217; flag need to be declared as <code>mutable<\/code> in C++ in order to keep the dereferencing operator a <code>const<\/code> method.<\/p>\n<p>So, in the header file we will have to extend the class definition with two additional member variables and the private utility method:<\/p>\n<pre class=\"brush: cpp; title: ; notranslate\" title=\"\">\r\n\t&#x5B;...]\r\n\tmutable char32_t mCurrentCodePoint;\r\n\tmutable bool mDirty;\r\n\r\n\tvoid CalculateCurrentCodePoint();\r\n<\/pre>\n<p>In the implementation file, we will have to initialise the new members in the constructors and in the assignment operator:<\/p>\n<pre class=\"brush: cpp; title: ; notranslate\" title=\"\">\r\nUtf8Iterator::Utf8Iterator(std::string::const_iterator it) :\r\nmStringIterator(it),\r\nmCurrentValue(0),\r\nmDirty(true)\r\n{\r\n}\r\n\r\nUtf8Iterator::Utf8Iterator(const Utf8Iterator&amp; source) :\r\nmStringIterator(source.mStringIterator),\r\nmCurrentValue(source.mCurrentValue),\r\nmDirty(source.mDirty)\r\n{\r\n}\r\n\r\nUtf8Iterator&amp; Utf8Iterator::operator=(const Utf8Iterator&amp; rhs)\r\n{\r\n\tmStringIterator = rhs.mStringIterator;\r\n\tmCurrentValue = rhs.mCurrentValue;\r\n\tmDirty = rhs.mDirty;\r\n\r\n\treturn *this;\r\n}\r\n<\/pre>\n<p>The increment and decrement operators will have to be modified so that they set the <code>mDirty<\/code> flag to true:<\/p>\n<pre class=\"brush: cpp; title: ; notranslate\" title=\"\">\r\nUtf8Iterator&amp; Utf8Iterator::operator++()\r\n{\r\n\t&#x5B;...]\r\n\r\n\tmDirty = true;\r\n\r\n\treturn *this;\r\n}\r\n\r\nUtf8Iterator&amp; Utf8Iterator::operator--()\r\n{\r\n\t&#x5B;...]\r\n\r\n\tmDirty = true;\r\n\r\n\treturn *this;\r\n}\r\n<\/pre>\n<p>The dereference operator can now be implemented in terms of the private method <code>CalculateCurrentCodePoint<\/code>:<\/p>\n<pre class=\"brush: cpp; title: ; notranslate\" title=\"\">\r\nUtf8Iterator::reference Utf8Iterator::operator*() const\r\n{\r\n\tCalculateCurrentCodePoint();\r\n\r\n\treturn mCurrentValue;\r\n}\r\n<\/pre>\n<p>Finally, the code that has been taken out of <code>operator*<\/code> must be moved to the implementation of <code>CalculateCurrentCodePoint<\/code>:<\/p>\n<pre class=\"brush: cpp; title: ; notranslate\" title=\"\">\r\nvoid Utf8Iterator::CalculateCurrentCodePoint()\r\n{\r\n\tif(mDirty)\r\n\t{\r\n\t\t&#x5B;...]\r\n\r\n\t\tmDirty = false;\r\n\t}\r\n}\r\n<\/pre>\n<h2>7. Adding error checks<\/h2>\n<p>The code we have written assumes that the iterator will have to deal with well-formed UTF-8. In production code, it is advisable to check whether the bit patterns adhere to the UTF-8 rules, and throw exceptions whenever any ill-formed UTF-8 is found. <\/p>\n<h2>8. Writing a similar class for UTF-16<\/h2>\n<p>Following the same line of reasoning, it is possible to write a <code>Utf16Iterator<\/code> class. In fact, <code>Utf16Iterator<\/code> is even easier to write since the UTF-16 algorithm is simpler. We would have to choose the character and string types that would replace <code>char<\/code> and <code>std::string<\/code> for UTF-16 characters. In the newer C++0x standard it is natural to use <code>char16_t<\/code> and <code>std::u16string<\/code>, but when programming for Windows with Visual Studio, and until there is better support for the C++0x types, we would have to use <code>wchar_t<\/code> and <code>std::wstring<\/code>.<\/p>\n<h2><a name=\"references\">9. References<\/a><\/h2>\n<ol>\n<li><em><a href=\"http:\/\/www.amazon.co.uk\/Design-patterns-elements-reusable-object-oriented\/dp\/0201633612\">Design Patterns. Elements of Reusable Object-Oriented Software<\/a><\/em>. Erich Gamma, Richard Helm, Ralph Johnson and John Vlissides. Addison-Wesley 1995.<\/li>\n<li><em><a href=\"http:\/\/www.unicode.org\/versions\/Unicode6.0.0\/ch03.pdf\">The Unicode Standard. Version 6.0 \u2013 Core Specification<\/a><\/em>. Standard specification. Refer to pages 93 and 94 for the rules of UTF-16 and UTF-8.<\/li>\n<li><em><a href=\"http:\/\/en.wikipedia.org\/wiki\/UTF-8\">UTF-8<\/a><\/em>. Article on English Wikipedia.<\/li>\n<li><em><a href=\"http:\/\/en.wikipedia.org\/wiki\/UTF-16\">UTF-16<\/a><\/em>. Article on English Wikipedia.<\/li>\n<li><em><a href=\"http:\/\/msdn.microsoft.com\/en-us\/magazine\/cc301955.aspx\">C++ and STL: Take Advantage of STL Algorithms by Implementing a Custom Iterator<\/a><\/em>. An article on custom iterators by Samir Bajaj on MSDN magazine.<\/li>\n<li><em><a href=\"http:\/\/groups.google.com\/group\/comp.lang.c++\/browse_frm\/thread\/596ea8a7e27a842f\">iterator_traits<\/a><\/em>. A discussion about <code>std::iterator_traits<\/code> on the comp.lang.c++ newsgroup.<\/li>\n<li><em><a href=\"http:\/\/groups.google.com\/group\/comp.lang.c++\/browse_frm\/thread\/cec6353b96991de7\">Help on iterator_traits?<\/a><\/em>. Another discussion about <code>std::iterator_traits<\/code> on the comp.lang.c++ newsgroup.<\/li>\n<\/ol>\n<div style=\"border-style:solid; border-color:rgb(231,231,231); border-width:1px 0 1px 0; padding:8px 0 2px 0; margin:20px 0 20px 0;\"><div style=\"float:left\"><a href=\"http:\/\/twitter.com\/share\" class=\"twitter-share-button\" data-count=\"horizontal\" data-via=\"retibus\">Tweet<\/a><script type=\"text\/javascript\" src=\"http:\/\/platform.twitter.com\/widgets.js\"><\/script><g:plusone size=\"medium\"><\/g:plusone><\/div><iframe src=\"http:\/\/www.facebook.com\/plugins\/like.php?href&amp;send=false&amp;layout=standard&amp;width=450&amp;show_faces=true&amp;action=like&amp;colorscheme=light&amp;font&amp;height=24\" scrolling=\"no\" frameborder=\"0\" style=\"border:none; overflow:hidden; width:400px; height:24px;\" allowTransparency=\"true\"><\/iframe><\/div><script type=\"text\/javascript\">(function() { var po = document.createElement('script'); po.type = 'text\/javascript'; po.async = true; po.src = 'https:\/\/apis.google.com\/js\/plusone.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })(); <\/script>","protected":false},"excerpt":{"rendered":"<p>As the last post in this series I&#8217;ve been writing on Unicode and UTF-8, I thought I would elaborate on an interesting idea I mentioned in my previous post. When discussing how a std::string object that stores UTF-8 text is &hellip; <a href=\"http:\/\/www.nubaria.com\/en\/blog\/?p=371\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[14,11,12],"tags":[],"class_list":["post-371","post","type-post","status-publish","format-standard","hentry","category-cc","category-character-encoding","category-unicode"],"_links":{"self":[{"href":"http:\/\/www.nubaria.com\/en\/blog\/index.php?rest_route=\/wp\/v2\/posts\/371","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.nubaria.com\/en\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.nubaria.com\/en\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.nubaria.com\/en\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.nubaria.com\/en\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=371"}],"version-history":[{"count":58,"href":"http:\/\/www.nubaria.com\/en\/blog\/index.php?rest_route=\/wp\/v2\/posts\/371\/revisions"}],"predecessor-version":[{"id":703,"href":"http:\/\/www.nubaria.com\/en\/blog\/index.php?rest_route=\/wp\/v2\/posts\/371\/revisions\/703"}],"wp:attachment":[{"href":"http:\/\/www.nubaria.com\/en\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=371"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.nubaria.com\/en\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=371"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.nubaria.com\/en\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=371"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}