summaryrefslogtreecommitdiff
path: root/doc/bugs
diff options
context:
space:
mode:
authorJoey Hess <joey@kodama.kitenet.net>2008-09-25 18:26:42 -0400
committerJoey Hess <joey@kodama.kitenet.net>2008-09-25 18:26:42 -0400
commit965f7310fec535d17e04464019de4ca75fa0afec (patch)
treebea5042f5c6511f23580e657daee516b878f2007 /doc/bugs
parenta4215312d0564e335b16ec6f02e90907426e3771 (diff)
git: Fix handling of utf-8 filenames in recentchanges.
Seems that the problem is that once the \nnn coming from git is converted to a single character, decode_utf8 decides that this is a standalone character, and not part of a multibyte utf-8 sequence, and so does nothing. I tried playing with the utf-8 flag, but that didn't work. Instead, use decode("utf8"), which doesn't have the same qualms, and successfully decodes the octets into a utf-8 character. Rant: Think for a minute about fact that any and every program that parses git-log, or git-show, etc output to figure out what files were in a commit needs to contain this snippet of code, to convert from git-log's wacky output to a regular character set: if ($file =~ m/^"(.*)"$/) { ($file=$1) =~ s/\\([0-7]{1,3})/chr(oct($1))/eg; } (And it's only that "simple" if you don't care about filenames with embedded \n or \t or other control characters.) Does that strike anyone else as putting the parsing and conversion in the wrong place (ie, in gitweb, ikiwiki, etc, etc)? Doesn't anyone who actually uses git with utf-8 filenames get a bit pissed off at seeing \xxx\xxx instead of the utf-8 in git-commit and other output?
Diffstat (limited to 'doc/bugs')
-rw-r--r--doc/bugs/unicode_encoded_urls_and_recentchanges.mdwn8
1 files changed, 6 insertions, 2 deletions
diff --git a/doc/bugs/unicode_encoded_urls_and_recentchanges.mdwn b/doc/bugs/unicode_encoded_urls_and_recentchanges.mdwn
index 9184a6b84..2568aef38 100644
--- a/doc/bugs/unicode_encoded_urls_and_recentchanges.mdwn
+++ b/doc/bugs/unicode_encoded_urls_and_recentchanges.mdwn
@@ -1,9 +1,13 @@
it appears that unicode characters in the title that are unicode letters are spared the __ filename encoding but instead saved in their utf8 encoding. (correct me if i'm wrong; didn't find the code that does this.) -- see below for examples.
+> Filenames can have any alphanumerics in them without the __ escaping.
+> Your locale determines whether various unicode characters are considered
+> alphanumeric. In other words, it just looks at the [[:alpha:]] character
+> class, whatever your locale defines it to be. --[[Joey]]
+
this is not a problem per se, but (at least with git backend) the recent changes missinterpret the file name character set (it seems to read the filenames as latin1) and both display wrong titles and create broken links.
the problem can be shown with an auto-setup'd ikiwiki without cgi when manually creating utf8 encoded filenames and running ikiwiki with LANG=en_GB.UTF-8 .
-----
+> Encoding issue, I figured out a fix. [[done]] --[[Joey]]
-[[right→link]], [[germanöumlaut]], [[smalµmu]], [[cyrillicԀletter]], [[tibetan༆something]], [[devanagariऔletter]], [[subscript₉nine]], [[subscriptₐa]]