Long live entities!

My excitement in the previous post was premature. Yes. I can set the charset parameter of an HTML meta tag to be UTF, like so:

That will allow me to create a UTF-8 web page and have it be served up and displayed properly.

But what happens when that UTF-8 is inside a URL on that page? What will the user’s browser put into the GET request when it requests the link? In Safari at least, it does a basic RFC 2396 encoding of the UTF-8 encoding characters. Yeah. That’s doesn’t work. Tell me again why I use version control?

Unfortunately, Safari can’t handle the URLs that are properly encoded using RFC 3987. That’s Safari’s problem, not mine. Curl handles them just fine (as long as the URL is in quotes).

No more entities!

I’ve always been annoyed at trying to write HTML pages in French. I have to keep remembering to convert my diacritical characters (ç,é, etc.) into HTML entities such as é. I saw this tip on MacOS X hints and figured my problems were over. It seems the lowly MacOS X TextEdit application can save text files as well-formed HTML. I promptly fired up TextEdit, entered my diacriticals, and checked it out in TextWrangler and, much to my dismay, no HTML entities. I replied to the hint on MacOS X Hints and expressed my disappointment. Someone responded to my response and told me how I can just set the charset parameter of my HTML file to UTF-8 and not worry about creating HTML entities. That worked perfectly! I don’t know why I had never put 2 and 2 together.

I have got to get a better understanding of Unicode and character sets. In the past, I’ve been scared away by such lovely packages as the Text Encoding Manager or its predecessors. Things still aren’t perfect in MacOS X, but they are much better.

MacOS X Pathname Fun

While working on my new program, I discovered some things about Apple’s Core Foundation. I need the ability to take a file system path, convert it to a URL, and back again. To accomplish this, I dutifully wrote a CF::URL C++ wrapper around the Core Foundation CFURLRef type. These URL functions are designed to conform to the URL synaxt as defined in RFC2396. After further review, I have decided that the Core Foundation URL routines are not viable. Here is what I’ve learned:

  • Apple’s CFURL functions seem to provide the opportunity to create a URL that may be relative to a particular URL base. This is exactly what I need. However, I find the function CFURLCreateWithFileSystemPathRelativeToBase very difficult to get working correctly. If you look at one of these URL functions wrong, it returns a null reference.
  • Any time you create a CFURLRef, you have to specify if the underlying object is a directory or not. In order to do that, I have to compose the path into an FSRef to check it. This seems backward. I have to do more work, and at a low-level, in order to represent the path in a URL.
  • Apple provides two key functions for URL management. The functions in question are CFURLCreateStringByReplacingPercentEscapes (to convert escape sequences back into the native form) and CFURLCreateStringByAddingPercentEscapes (to add escape sequences to characters that are invalid. So far, so good.I had a bit of a problem getting them to work. As a last resort, I copied the source to CFURLCreateStringByReplacingPercentEscapes (from the Apple Open Source repository) into my code so I could step through it. When the function hits my test character (é) it properly converts it from the escape sequence into the appropriate Unicode character, then does an additional check. Since the low-order bit (!(bytes[0] & 0x10)) is not set, it thinks the sequence should have been encoded using three bytes. Then, since the rest of the string is shorter than 9 bytes (?), it fails. The RFC says nothing about 3 (or 9) digit escape sequences.

    I still haven’t given up. Having a correct URL is not critical for my task. I just have to be able to convert from and to a URL-friendly encoding. I had been encoding my own URLs from a Unicode file system path. I switched to using Apple’s function CFURLCreateStringByAddingPercentEscapes. Then I checked the output. Sure enough, every single space character was converted to “%20” as expected. However, it left the (é) character untranslated. Why? The Apple decoder thinks that character should have been decomposed into a 3 character sequence (contrary to RFC2396), while the encoded doesn’t encode it at all!

My whole reason for doing this was that a MacOS X path can contain Unicode characters. I thought a URL would be a good way to encode this information. I want to keep the paths a native as possible to avoid converting to and from different encodings. But when it is all said and done, I can create an FSRef using the function FSPathMakeRef, which takes a UInt8 * string path, apparently in Unicode using UTF-8. I had wanted to stick to real Unicode strings instead of dealing with UInt8 * strings in UTF8. It looks like I have no other choice. Apple’s routines cannot process URLs correctly. I have already successfully converted UTF8 strings to Unicode CFStringRefs and back. I guess I will just use the UTF8 strings, convert them to CFStringRef types, and then convert that to Unicode. Maybe I should look for a way to directly parse UTF8 UInt8 * strings and avoid CFStringRef altogether.

And, as a side note. I have already discovered that I can’t use functions such as CFStringCreateWithFileSystemRepresentation, which would seem to be ideal. First of all, these functions are only in MacOS 10.4. At one point, I though I would be clever and copy the code (from open source) and use it in a 10.2-compatible program. It had too many dependencies for that, and, in addition, the code was horrible. Futher investigation revealed that similar functions have known security bugs. What a mess! From what I can tell, MacOS X uses UTF-8 to encode Unicode file paths into POSIX-friendly versions. I did this and it seems to work fine.

As for URLs. I will roll my own. If I have a character that is greater than 255, I will use Microsoft’s non-standard URL format of %u0041. I can’t find any Microsoft reference for this. All I see are notes about a security hole due to Microsoft’s poor implementation of their own, non-standard protocol. I’ll use their protocol, but I won’t use their bug. If any web browsers connect to my app and get ahold of a Unicode URL, I hope they are using IE.

More Open Source Fun

I haven’t put wget on my system yet. I saw a note on digg about it. Someone commented on how you can use “-c” to resume previous wget session. I never knew that and had explicity looked for such an option. Hmm… I must have missed it somehow. Someone else posted a note about CocoaWget, yet another program I wanted to write and never had the time to. Wow. Two useful comments on digg in the same day – a record.

In any event, the developer of CocoaWget didn’t say a word about Universal Binary. But it is just a wrapper around wget, so I though it would be a good idea to test out my uconfigure and umake scripts.

First of all, I had to change my old uconfigure to uconfigurelib. If, for some reason, I want to make a Universal Binary application like wget, I don’t want it installed in /Programming/Libraries. So I moved that option into uconfigurelib and took it out of uconfigure. Also, trying to run autoreconf returned errors. I guess libevent is the one that needs upgrading. My uconfigure and umake scripts worked perfectly with wget.

Upon further inspection of CocoaWget, it is a Universal Binary, as is its included wget command line too. All that for naught. Oh well, I got uconfigurelib created and updated uconfigure and learned a bit more about autoconf. And I didn’t waste too much time.

Open Source Universal Binaries

I would like to get Tor running on my Macbook. Unfortunately, the developers haven’t figured out how to build Universal Binaries for MacOS X yet. I’m not waiting for them. I had earlier hacked together an Intel Privoxy, so Tor shouldn’t be too hard.

The Tor configure fails because I don’t have libevent. Libevent looks very interesting. It could be a candidate for my “good” category of OSS. I might want to incorporate libevent into some shareware. I had been just linking all my libraries statically. Now that I have Universal Binaries to worry about, perhaps I should bite the bullet and figure out how to distribute dynamic libraries inside my .app packages. I also don’t want to just install it on my system because I need to test my yet-to-be-developed scheme for distributing dynamic libraries. I think it is a good idea to develop my software in the same environment in which it will be run. Yeah. I’m a wacko. Maybe when all this is done, I’ll fix my hacked-up Privoxy and submit patches to the offical Tor and Privoxy people. They seem to need to help.

Luckily, Apple has a nice technote on how to build Universal Binary configure-based open source software. It would be even nicer if it worked. Apple’s instructions will not create Universal Binary dynamic libraries.

This guy seems to know what he’s talking about. He thinks I need to download new versions of autoconf, automake, and libtool. I’m not a Linux guy so I would prefer not to muck around with my system. I keep looking and find this post. He claims to have gotten it working. He also says you can install autoconf, automake, and libtool in /usr/local. OK, I’ll do that.

I download, configure, and install autoconf, automake, and libtool. No problems.

Next, I go into my libevent directory and run:

autoreconf -fvi

This may not be necessary. See the next post.

Now, in theory, make should forward those “-arch” arguments on to the linker. I’m ready to re-configure. I setup some handy aliases in my bash profile for configuring and building Universal Binaries. You have to modify CFLAGS to specify that you want both the ppc and 386 architectures built. You have to add the new isysroot thingy. Finally, you have to disable the default dependency tracking to handle the new split personality nature of the object files. Here are my aliases:

alias uconfigure="env CFLAGS=\"-isysroot /Developer/SDKs/MacOSX10.4u.sdk -arch i386 -arch ppc\" ./configure --prefix=/Programming/Libraries --disable-dependency-tracking"
alias umake="env CFLAGS=\"-isysroot /Developer/SDKs/MacOSX10.4u.sdk -arch i386 -arch ppc\" make"

Note that I am going to install any Universal Binaries into /Programming/Libraries. That way, the system won’t see them. The only way they’ll work is if I figure out how to put them inside a .app bundle. (I hope that works or I’ll have to re-do all this.)

This is all similar to Apple’s instructions. Apple also had us munging up the LDFLAGS, turning off optimization, and turning on debug. I had read somewhere (I forget where, hence the need for this blog) that you don’t have to mess with LDFLAGS. Also, I’ll let the makefile handle optiization and debugging.

It all works. I can now build a Universal Binary libevent and Tor. Next, I’ll fix the official Tor and Privoxy packages. That will take more work. Plus I like to run Privoxy all the time. Tor messes with Privoxy’s config file. I need to figure out a way to switch between them. That’s for later.

I can now build Universal Binary dynamic libraries from OSS. I can easily install them into a hidden, but usable location. I’ve learned about libevent. Overall, a productive tangent.

PS: Today is 10-05-2006 and the blog has come in handy. I forgot how to check the executable type of a file. The above link the Apple technote helped, but, for the record, it is:


About this blog

Don’t expect any soul-searching or witty political commentary. This site is a place to upload things I’ve learned about programming so that the blog can remember it for me. Anyone interested in C++, Objective-C++, or Perl programming, mostly on a Macintosh, might find some useful things here too.