A code point iterator adapter for C++ strings in UTF-8

As the last post in this series I’ve been writing on Unicode and UTF-8, I thought I would elaborate on an interesting idea I mentioned in my previous post. When discussing how a std::string object that stores UTF-8 text is just a sequence of raw bytes rather than Unicode code points, I hinted that it wouldn’t be difficult to write a special iterator class for those situations where we may need to traverse the code point values rather than the bytes. In this post I explain how to write such an iterator class.

Note that most of the time, say when we want to iterate through a string to find a particular character or substring, the ordinary std::string::iterator and const_iterator objects are fine. In previous posts I already commented on the interesting properties of UTF-8 that make such operations safe. But there are still times when we may need to check the particular numeric values of the code points that make up a Unicode string. For example, the XML specification restricts the code points that are allowed for many tokens, so a good XML parser would need to check the validity of code points. Or we may want to identify text in a particular language. As an example of this, let’s think up an exercise: how could we check if a UTF-8-encoded string contains any Armenian text? The Unicode character charts show that the range of code points assigned to the Armenian script goes from 0x530 to 0x58F, so we will need a way to extract the numeric values from the UTF-8 text as we traverse it.

What we need for this is an adapter-style iterator class, let’s call it Utf8Iterator, that we can initialize with a std::string::const_iterator so that, on the one hand, each iteration would make the internal iterator advance as many bytes as make up a full character and, on the other hand, dereferencing the iterator would yield a Unicode code point as a 32-bit value. If we manage to write such a class, checking for the appearance of Armenian text would be as easy as pie:

bool TextHasArmenianCharacters(const std::string& text)
{
	bool found = false;
	
	for(Utf8Iterator codePointIterator = text.begin(); !found && codePointIterator != text.end(); ++codePointIterator)
		found = *codePointIterator >= 0x530 && *codePointIterator <= 0x58f;

	return found;
}

1. The Adapter pattern

The class we want should manage the core string iterator object by making it move forward or backward by up to four steps for every code point (a Unicode code point is made up of four bytes at most in the UTF-8 encoding) and provide a dereferencing mechanism that returns a 32-bit value rather than the plain chars that std::string uses. This is an example of the Adapter design pattern. The new class should aggregate the plain string iterator and rely on it for traversing the string, but adapting the semantics and the state of each iteration to the knowledge of Unicode code points, a concept that dumb std::strings are completely ignorant of.

This adapter class will only be used for reading values, not for overwriting the std::string object, so it should work like a const_iterator. We can start writing this adapter as a class that stores a std::string::const_iterator assigned on construction. In the header file we will have:

class Utf8Iterator
{
public:

	Utf8Iterator(std::string::const_iterator it);

	Utf8Iterator(const Utf8Iterator& source);

	Utf8Iterator& operator=(const Utf8Iterator& rhs);

	~Utf8Iterator();

private:

	std::string::const_iterator mStringIterator;
};

The above code is a basic skeleton of a class that uses another class by aggregation and supports the ordinary copy semantics. I have decided not to make the constructor explicit so that std::string::const_iterators can be cast directly to the Utf8Iterator type. If the safety of explicit initialisation is preferred, the keyword explicit should be added in front of the constructor. We could also add an empty constructor if we wanted to be able to declare these objects before initialising them (I have added such a constructor to my own implementation), but that’s usually not necessary. In the .cpp file we will have:


Utf8Iterator::Utf8Iterator(std::string::const_iterator it) :
mStringIterator(it)
{
}

Utf8Iterator::Utf8Iterator(const Utf8Iterator& source) :
mStringIterator(source.mStringIterator)
{
}

Utf8Iterator& Utf8Iterator::operator=(const Utf8Iterator& rhs)
{
	mStringIterator = rhs.mStringIterator;
	return *this;
}

Utf8Iterator::~Utf8Iterator()
{
}

Apart from the possibility of adding an empty constructor, it may be necessary to add a constructor that takes a std::string::iterator parameter to get round some type conversion issues. That additional constructor, not shown here, should look exactly the same as the one that uses const_iterator.

2. A bidirectional iterator

Now we want this class to behave as an iterator, but what kind of iterator exactly? The C++ standard library differentiates between several kinds of iterators, depending on which operations are allowed. For example, if it is a std::vector::iterator we can use it + 3 to advance the iterator by three steps, or it - 2 to move backwards in the container by two steps. Provided that we don’t move beyond the valid range (and it is up to our code, not to the compiler, to check that) these expressions can be dereferenced to access the values at those positions. This kind of very versatile iterator is a random-access iterator and it is also the kind of iterator used by std::string. But if it is a std::list::iterator instead, then the compiler will complain with an error when it finds an expression like it + 3. This is because the std::list uses a more limited kind of iterator called ‘bidirectional iterator’, which only allows for one increment or decrement at a time through expressions like ++it or --it. The standard C++ library provides random-access iterators when multiple increments can be carried out in constant time, but not when it is a linear-time operation. Such a design actively discourages iteration idioms that would be inefficient. That’s why the std::list::iterators aren’t random-access, even if it would be straightforward to implement operator+(int) by calling operator++ repeatedly.

Since the std::string uses random-access iterators, we may be tempted to think that Utf8Iterator should also be a random-access iterator. But if we think about this carefully, we’ll find that it is not possible to implement a codepoint-based iterator in such a way that random access is possible at constant time. A std::string is just a sequence of bytes with a fixed size, so moving N steps ahead is just a matter of adding N to a pointer, but we can’t advance N code points in a UTF-8 sequence by such simple arithmetic. Because each code point may span between one and four bytes, we need to scan all the bytes in succession and count the steps until we reach N increments. This is a linear-time, not constant-time, operation, just like traversing a linked list. Because of this limitation on random access, it makes sense to treat a codepoint-based iterator as a bidirectional one. Also, I can’t think of any real situation where we would want to move the current iterating position in a UTF-8 sequence by a certain number of code points. So, both in terms of efficiency and usability it makes sense to implement this sort of class as a bidirectional iterator.

Being a bidirectional iterator, we can ignore those operators that add or subtract integer values and only need to implement the pre and post versions of the ++ and -- operators:


Utf8Iterator& Utf8Iterator::operator++()
{
	//TO DO: Advance mStringIterator by the size in bytes of the current code point.
	return *this;
}

Utf8Iterator Utf8Iterator::operator++(int)
{
	Utf8Iterator temp = *this;
	++(*this);
	return temp;
}

Utf8Iterator& Utf8Iterator::operator--()
{
	//TO DO: Move mStringIterator back by the size in bytes of the preceding code point.
	return *this;
}

Utf8Iterator Utf8Iterator::operator--(int)
{
	Utf8Iterator temp = *this;
	--(*this);
	return temp;
}

The above code contains the trivial part of the implementation and two ‘TO DO’ comments for the part that requires knowledge of the UTF-8 syntax rules. We will come back to that later.

3. The current code point value

An iterator class has a state, which is the current position on the sequence it manages. In the general Iterator pattern as described in the classic 1995 Design Patterns book (see the references below), iterator classes have a CurrentItem method to access the item where the iteration process happens to be. In the C++ standard library, the syntax for iterators is based on pointers (and the most straightforward implementation for random-access iterators are indeed pointers), so the current item in the standard C++ iterators is accessed through the dereferencing operator*, which is also the syntactic approach I’ve followed in the initial Armenian problem. So, by implementing the dereferencing operator we can provide a natural way to access the current Unicode code point value. Which type should we use for this value? Before the new C++0x standard, any unsigned integer at least 32 bits in size would do. In the soon-to-be-official C++0x standard, already partially implemented by the main compilers, there is the char32_t type, which is ideal both in terms of size and semantics for this. This is the type I will use for the code point values. If you’re using a compiler that doesn’t support this new standard type, you can substitute any 32-bit unsigned integer type.

So, we can now add operator* to our code. Since this class is only for reading, we only want to implement the const version. In the .cpp file:


char32_t Utf8Iterator::operator*() const
{
	char32_t currentCodePoint = 0;

	//TO DO: Calculate the code point at the position pointed to by mStringIterator. 

	return currentCodePoint;
}

As we did before, we’ll concentrate on the structure of the class first, and leave the nitty-gritty for later.

4. Adding comparison operators

When we use an iterator class we usually have to compare an iterator that is traversing a container with other iterators that point to particular positions in the container, typically the beginning and the end. Such operations require that the comparison operators for equality and difference, == and !=, be defined. We can add these operators to the class in a straightforward way by comparing the internal iterators. In the .cpp file we will have:


bool Utf8Iterator::operator==(const Utf8Iterator& rhs) const
{
	return mStringIterator ==  rhs.mStringIterator;
}

bool Utf8Iterator::operator!=(const Utf8Iterator& rhs) const
{
	return mStringIterator !=  rhs.mStringIterator;
}

Since we will often want to compare these code point iterators against the iterators returned by the std::string::begin() and std::string::end() methods, it makes sense to add overloaded versions of these operators in order to avoid relying on too many unnecessary type conversions and temporary objects, like this:


bool Utf8Iterator::operator==(std::string::iterator rhs) const
{
	return mStringIterator ==  rhs;
}

bool Utf8Iterator::operator==(std::string::const_iterator rhs) const
{
	return mStringIterator ==  rhs;
}

bool Utf8Iterator::operator!=(std::string::iterator rhs) const
{
	return mStringIterator !=  rhs;
}

bool Utf8Iterator::operator!=(std::string::const_iterator rhs) const
{
	return mStringIterator !=  rhs;
}

5. Deriving from std::iterator

The standard C++ library has a std::iterator template that defines some handy typedefs such as value_type. Actually, algorithms that work on iterators should not use these typedefs since pointers are also valid iterators (something like std::vector::iterator::value_type v; can’t possibly compile if std::vector::iterator happens to be an int*). The std::iterator template is however useful because there is another template, std::iterator_traits, which provides the right way for algorithms to access type information from iterators (std::iterator_traits::iterator>::value_type v; should always compile), and std::iterator_traits will be generated from std::iterator if no iterator_traits specialisation is provided. So, in order to save us writing an iterator_traits specialisation, it makes sense to derive the iterator class from std::iterator.

Deriving from std::iterator has the additional advantage of making the purpose of the class clearer to casual readers of the source code, who will identify the Utf8Iterator class as a bidirectional iterator type straight away. So, in the header file we can modify the declaration of the class to add inheritance from std::iterator:

class Utf8Iterator : public std::iterator<std::bidirectional_iterator_tag, char32_t, std::string::difference_type, const char32_t*, const char32_t&>
{
	[...]
}

6. Applying the UTF-8 rules

We now have all the scaffolding in place, but we still have to write the non-trivial part of the implementation. This affects three methods: operator* to read the current code point, operator++ to advance the internal iterator by one full code point, and operator-- to move backward by one code point. The encoding rules of UTF-8 are explained in the standard specification for Unicode. Another very good reference is the English Wikipedia article (see the references below).

In order to implement operator++ we need to check the pattern of the byte the internal iterator points to. If the leftmost bit is not set (i.e. it has a value below 128) then it must be a one-byte-long ASCII character. Otherwise, the initial byte must have a bit pattern that may begin with ‘110’, if it’s a two-byte code, ‘1110’, if it’s a three-byte code, or ‘11110’ if it’s a four-byte code. We can implement this in the .cpp file as follows:

const unsigned char kFirstBitMask = 128; // 1000000
const unsigned char kSecondBitMask = 64; // 0100000
const unsigned char kThirdBitMask = 32; // 0010000
const unsigned char kFourthBitMask = 16; // 0001000
const unsigned char kFifthBitMask = 8; // 0000100

[...]

Utf8Iterator& Utf8Iterator::operator++()
{
	char firstByte = *mStringIterator;

	std::string::difference_type offset = 1;

	if(firstByte & kFirstBitMask) // This means the first byte has a value greater than 127, and so is beyond the ASCII range.
	{
		if(firstByte & kThirdBitMask) // This means that the first byte has a value greater than 224, and so it must be at least a three-octet code point.
		{
			if(firstByte & kFourthBitMask) // This means that the first byte has a value greater than 240, and so it must be a four-octet code point.
				offset = 4;
			else
				offset = 3;
		}
		else
		{
			offset = 2;
		}
	}

	mStringIterator += offset;

	return *this;
}

To keep things simple, I have omitted checks for invalid UTF-8 syntax in the code. In production code, we would want to throw exceptions if an invalid UTF-8 sequence is found.

Now let’s see how to write operator--. We will have to decrement the internal iterator and first check whether it is an ASCII value (an unset leftmost bit). If that’s the case, then the decremented internal iterator is already pointing to the previous code point. Otherwise we’ll have to decrement the internal iterator again up to three times until we find a bit pattern with the two leftmost bits set. Again, I omit any error-checking for ill-formed UTF-8:

Utf8Iterator& Utf8Iterator::operator--()
{
	--mStringIterator;

	if(*mStringIterator & kFirstBitMask) // This means that the previous byte is not an ASCII character.
	{
		--mStringIterator;
		if((*mStringIterator & kSecondBitMask) == 0)
		{
			--mStringIterator;
			if((*mStringIterator & kSecondBitMask) == 0)
			{
				--mStringIterator;
			}
		}
	}
	
	return *this;
}

And now there’s only one method left, the dereferencing operator, which will have to make use of the UTF-8 rules to compose the code point value out of the bytes that make up the character. This is done as follows (excuse the magic numbers; feel free to take them out as constants if you want to use this code):

char32_t Utf8Iterator::operator*() const
{
	char32_t codePoint = 0;

	char firstByte = *mStringIterator;

	if(firstByte & kFirstBitMask) // This means the first byte has a value greater than 127, and so is beyond the ASCII range.
	{
		if(firstByte & kThirdBitMask) // This means that the first byte has a value greater than 191, and so it must be at least a three-octet code point.
		{
			if(firstByte & kFourthBitMask) // This means that the first byte has a value greater than 224, and so it must be a four-octet code point.
			{
				codePoint = (firstByte & 0x07) << 18;
				char secondByte = *(mStringIterator + 1);
				codePoint +=  (secondByte & 0x3f) << 12;
				char thirdByte = *(mStringIterator + 2);
				codePoint +=  (thirdByte & 0x3f) << 6;;
				char fourthByte = *(mStringIterator + 3);
				codePoint += (fourthByte & 0x3f);
			}
			else
			{
				codePoint = (firstByte & 0x0f) << 12;
				char secondByte = *(mStringIterator + 1);
				codePoint += (secondByte & 0x3f) << 6;
				char thirdByte = *(mStringIterator + 2);
				codePoint +=  (thirdByte & 0x3f);
			}
		}
		else
		{
			codePoint = (firstByte & 0x1f) << 6;
			char secondByte = *(mStringIterator + 1);
			codePoint +=  (secondByte & 0x3f);
		}
	}
	else
	{
		codePoint = firstByte;
	}

	return codePoint;
}

And that’s it. We can now test the TextHasArmenianCharacters function in the exercise we suggested at the beginning, and it should only return true if the supplied text contains any Armenian characters.

6. An optimisation: caching the current value

The implementation we have for the Utf8Iterator class should work correctly, but it has a performance drawback that we can avoid. Basically, if we have declared a Utf8Iterator it, then every time we dereference it (*it), the value will be recalculated. Since it is common to dereference an iterator more than once within the same iteration step (the TextHasArmenianCharacters example does it twice), it is better in terms of efficiency to cache the code point value, so that it need not be recalculated in multiple dereferences. This can be done by adding a new member to the class: char32_t mCurrentCodePoint. The code that we’ve written for operator* can then be moved to a utility private method CalculateCurrentCodePoint. We might be tempted to call this method from the increment and decrement operators, but there is a problem with that, which is that an iterator is not always dereferenceable. In particular, mStringIterator may be pointing to the end position of a string. That’s why typical iterating code usually compares iterator values with the one returned by the std::string::end method before attempting to dereference it. Because of this behaviour of iterators, we can only dereference the internal mStringIterator when the code point iterator itself is being dereferenced.

But then we need a way to know whether the current code point value needs recalculation or not. We can do this through a ‘dirty’ flag that must be set to false whenever the code point value needs to be reevaluated. We will simply need to set it to true after the initial assignment and when the iterator is incremented or decremented. CalculateCurrentCodePoint will return straight away if the ‘dirty’ flag is false and recalculate the code point and reset the flag to false if it is true. Both the cached value and the ‘dirty’ flag need to be declared as mutable in C++ in order to keep the dereferencing operator a const method.

So, in the header file we will have to extend the class definition with two additional member variables and the private utility method:

	[...]
	mutable char32_t mCurrentCodePoint;
	mutable bool mDirty;

	void CalculateCurrentCodePoint();

In the implementation file, we will have to initialise the new members in the constructors and in the assignment operator:

Utf8Iterator::Utf8Iterator(std::string::const_iterator it) :
mStringIterator(it),
mCurrentValue(0),
mDirty(true)
{
}

Utf8Iterator::Utf8Iterator(const Utf8Iterator& source) :
mStringIterator(source.mStringIterator),
mCurrentValue(source.mCurrentValue),
mDirty(source.mDirty)
{
}

Utf8Iterator& Utf8Iterator::operator=(const Utf8Iterator& rhs)
{
	mStringIterator = rhs.mStringIterator;
	mCurrentValue = rhs.mCurrentValue;
	mDirty = rhs.mDirty;

	return *this;
}

The increment and decrement operators will have to be modified so that they set the mDirty flag to true:

Utf8Iterator& Utf8Iterator::operator++()
{
	[...]

	mDirty = true;

	return *this;
}

Utf8Iterator& Utf8Iterator::operator--()
{
	[...]

	mDirty = true;

	return *this;
}

The dereference operator can now be implemented in terms of the private method CalculateCurrentCodePoint:

Utf8Iterator::reference Utf8Iterator::operator*() const
{
	CalculateCurrentCodePoint();

	return mCurrentValue;
}

Finally, the code that has been taken out of operator* must be moved to the implementation of CalculateCurrentCodePoint:

void Utf8Iterator::CalculateCurrentCodePoint()
{
	if(mDirty)
	{
		[...]

		mDirty = false;
	}
}

7. Adding error checks

The code we have written assumes that the iterator will have to deal with well-formed UTF-8. In production code, it is advisable to check whether the bit patterns adhere to the UTF-8 rules, and throw exceptions whenever any ill-formed UTF-8 is found.

8. Writing a similar class for UTF-16

Following the same line of reasoning, it is possible to write a Utf16Iterator class. In fact, Utf16Iterator is even easier to write since the UTF-16 algorithm is simpler. We would have to choose the character and string types that would replace char and std::string for UTF-16 characters. In the newer C++0x standard it is natural to use char16_t and std::u16string, but when programming for Windows with Visual Studio, and until there is better support for the C++0x types, we would have to use wchar_t and std::wstring.

9. References

  1. Design Patterns. Elements of Reusable Object-Oriented Software. Erich Gamma, Richard Helm, Ralph Johnson and John Vlissides. Addison-Wesley 1995.
  2. The Unicode Standard. Version 6.0 – Core Specification. Standard specification. Refer to pages 93 and 94 for the rules of UTF-16 and UTF-8.
  3. UTF-8. Article on English Wikipedia.
  4. UTF-16. Article on English Wikipedia.
  5. C++ and STL: Take Advantage of STL Algorithms by Implementing a Custom Iterator. An article on custom iterators by Samir Bajaj on MSDN magazine.
  6. iterator_traits. A discussion about std::iterator_traits on the comp.lang.c++ newsgroup.
  7. Help on iterator_traits?. Another discussion about std::iterator_traits on the comp.lang.c++ newsgroup.
This entry was posted in C/C++, Character encoding, Unicode. Bookmark the permalink.

9 Responses to A code point iterator adapter for C++ strings in UTF-8

  1. Chet says:

    I just saw this post before I left the office. This is what I’ve been looking for for a while now.

    I’ll read it in depth tomorrow. I’m having some issues with displaying Japanese characters. Well most of them work fine. It turns out there’s a type of characters named a ‘surrogate pair’ which I believe take two wchar_t to fill. This is messing up the output. :S

    • Ángel José Riesgo says:

      Note: My original reply, like some of the posts in this thread, got accidentally deleted while removing spam messages a few months ago. I have manually restored the affected messages but I haven’t been able to recover my replies. As far as I can remember the gist of my reply to Chet was more or less as follows:

      Surrogate pairs only affect extremely rare characters that didn’t make it into the Unicode Basic Multilingual Plane. So, if you’re processing normal Japanese text, surrogate pairs should not appear at all. Only if the Japanese text you have contains dialectal or ancient characters that aren’t part of the standard modern language could you bump into surrogate pairs. Could you give any examples of those characters that are giving you trouble?

  2. Andreas says:

    Hello, great article!
    I am working on a hobby project and I am trying to be able to enumerate through all possible utf-8 characters.
    I would like to be able to iterate through all of the utf-8 characters in a string, but I am not sure how to iterate over the all the code points.
    Could you give me a pointer on start and end conditions that allows me to enumerate with controll?

  3. Andreas says:

    omg sorry, I ment:

    I would like to be able to iterate through all of the utf-8 characters, but I am not sure how to iterate over the all the code points.
    Could you give me a pointer on start and end conditions that allows me to enumerate with controll?

    • Ángel José Riesgo says:

      Apologies for the late reply. I’ve been very busy during the last few weeks and I’m finding it hard to check and update this blog regularly.

      In Unicode valid code points are any integer numbers up to 0x10ffff (1114111). Since the [0,31] range in the ASCII subset is made up of non-printable characters, this leaves a valid range of [32, 1114111]. So, if you can use the UTF-32 encoding where every character is coded just with its integer value, iterating through all the numbers would yield all Unicode characters. So, if you have a Dump function which does whatever you want to do with each character you would do:

      for(char32_t c = 0x20; c < 0x110000; ++c)
      Dump(c);

      Now if you want to get the UTF-8 (or UTF-16) representation of each code point, things get trickier. You will need a function that converts the code point into a sequence of bytes (or 16-bit units). Let’s call this function CodePointToUtf8. I’m working on my old laptop now and I haven’t been able to test it, but I think in C++ 11, the following definition should work:

      #include <string>
      #include <locale>
      #include <codecvt>

      std::string CodePointToUtf8(char32_t codePoint)
      {
      std::string utf8Bytes;

      std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> convert;
      std::u32string utf32String(1, codePoint);
      utf8Bytes = convert.to_bytes(utf32String);

      return utf8Bytes;
      }

      The UTF-16 version would be similar. Now if you’re using an older C++ compiler that doesn’t support wstring_convert and codecvt_utf8, like the Visual Studio 2010 I have on this laptop, then you will need to generate the byte sequences more laboriously by working through the UTF-8 specification. The following function I wrote a few years back should work, I think:

      #include <string>

      std::string CodePointToUtf8(char32_t codePoint)
      {
      assert(codePoint < 0x110000);

      std::string utf8Bytes;

      if(codePoint < 0x80)
      {
      char c = static_cast<char>(codePoint);
      utf8Bytes.push_back(c);
      }
      else if(codePoint < 0x800)
      {
      // The code point has the binary format 00000yyy yyxxxxxx.
      // It must be turned into two bytes: (110yyyyy, 10xxxxxx)
      utf8Bytes.reserve(2);
      char c1 = (0xc0 | (codePoint >> 6)); // This is 110yyyyy
      utf8Bytes.push_back(c1);
      char c2 = ((codePoint) | (1 << 7)) & ~(1 << 6); // This is 10xxxxxx
      utf8Bytes.push_back(c2);
      }
      else if(codePoint < 0x10000)
      {
      // The code point has the binary format zzzzyyyy yyxxxxxx.
      // It must be turned into three bytes: (1110zzzz, 10yyyyyy, 10xxxxxx)
      utf8Bytes.reserve(3);
      char c1 = (0xe0 | (codePoint >> 12)); // This is 1110zzzz
      utf8Bytes.push_back(c1);
      char c2 = ((codePoint >> 6) | (1 << 7)) & ~(1 << 6); // This is 10yyyyyy
      utf8Bytes.push_back(c2);
      char c3 = ((codePoint) | (1 << 7)) & ~(1 << 6); // This is 10xxxxxx
      utf8Bytes.push_back(c3);
      }
      else if(codePoint < 0x110000)
      {
      // The code point has the binary format 000wwwzz zzzzyyyy yyxxxxxx.
      // It must be turned into four bytes: (11110www, 10zzzzzz, 10yyyyyy, 10xxxxxx)
      utf8Bytes.reserve(4);
      char c1 = (0xf0 | (codePoint >> 18)); // This is 11110www
      utf8Bytes.push_back(c1);
      char c2 = ((codePoint >> 12) | (1 << 7)) & ~(1 << 6); // This is 10zzzzzz
      utf8Bytes.push_back(c2);
      char c3 = ((codePoint >> 6) | (1 << 7)) & ~(1 << 6); // This is 10yyyyyy
      utf8Bytes.push_back(c3);
      char c4 = ((codePoint) | (1 << 7)) & ~(1 << 6); // This is 10xxxxxx
      utf8Bytes.push_back(c4);
      }

      return utf8Bytes;
      }

  4. panqnik says:

    Hey there,
    Very useful post! I would like to invite you to visit my blog as well, and read my latest post about sequence points in C and C++.

    http://blog.panqnik.pl/dev/sequence-points-in-c-cpp/

    Best regards,
    panqnik

  5. This is awesome. Thanks! I have a question about how can one extract a sub-string? For example like:


    std::string word("...............");
    Utf8Iterator it = word.begin();
    ++it; ++it; ++it;
    std::string subword( word.begin(), it );

    such that subword should have first 3 code points with word?

    • To answer my own question it looks like this will work:


      std::string subword( Utf8Iterator( word.begin() ), it );

      • Ángel José Riesgo says:

        Hi Stephen. That line does compile, but I don’t think it works. Unless there’s something I’m missing, I expect it will build the ‘subword’ string by appending to it the result of casting each 32-bit code point value to a truncated 8-bit char, so as soon as you come across any non-ASCII character you’d get rubbish rather than the expected Unicode text.

        I think you will need to define a vector of code points and then use the function CodePointToUtf8 that I mentioned in a comment just above yours. In easy-to-read code, it would be something like this:

        std::vector<char32_t> codePoints( Utf8Iterator( word.begin() ), it );
        std::string subword;
        
        for (char32_t codePoint : codePoints)
        {
        	subword += CodePointToUtf8(codePoint);
        }
        

Leave a Reply

Your email address will not be published. Required fields are marked *