How to convert an emoticon specified by a U+xxxxx code to utf-8?2019 Community Moderator ElectionBash script to get ASCII values for alphabetHow can I find the common name for a particular glyph?How to convert to HTML code?How can I set VIM's default encoding to UTF-8?How to replace all percent-encoded UTF-8 substrings with plain UTF-8 text?Convert ASCII-8BIT to UTF-8 using OSX' .bash_profilesupport for utf-8 encoding with lprCan not use `cut -c` (`--characters`) with UTF-8?How to convert unknown-8bit file to utf8Convert an ASCII file with octal escapes for UTF-8 codes to UTF-8Curl JSON encoded in UTF-8How to only keep BMP in the utf-8 text file?
How to interpret the phrase "t’en a fait voir à toi"?
Reply ‘no position’ while the job posting is still there (‘HiWi’ position in Germany)
Simple recursive Sudoku solver
Giant Toughroad SLR 2 for 200 miles in two days, will it make it?
Teaching indefinite integrals that require special-casing
Latex for-and in equation
Resetting two CD4017 counters simultaneously, only one resets
Installing PowerShell on 32-bit Kali OS fails
How can I successfully establish a nationwide combat training program for a large country?
Can a controlled ghast be a leader of a pack of ghouls?
How to check participants in at events?
Adding empty element to declared container without declaring type of element
Is infinity mathematically observable?
Stereotypical names
Partial sums of primes
The One-Electron Universe postulate is true - what simple change can I make to change the whole universe?
What was required to accept "troll"?
Lifted its hind leg on or lifted its hind leg towards?
What do you call the infoboxes with text and sometimes images on the side of a page we find in textbooks?
Can the harmonic series explain the origin of the major scale?
What is the term when two people sing in harmony, but they aren't singing the same notes?
Visiting the UK as unmarried couple
In Star Trek IV, why did the Bounty go back to a time when whales were already rare?
Freedom of speech and where it applies
How to convert an emoticon specified by a U+xxxxx code to utf-8?
2019 Community Moderator ElectionBash script to get ASCII values for alphabetHow can I find the common name for a particular glyph?How to convert to HTML code?How can I set VIM's default encoding to UTF-8?How to replace all percent-encoded UTF-8 substrings with plain UTF-8 text?Convert ASCII-8BIT to UTF-8 using OSX' .bash_profilesupport for utf-8 encoding with lprCan not use `cut -c` (`--characters`) with UTF-8?How to convert unknown-8bit file to utf8Convert an ASCII file with octal escapes for UTF-8 codes to UTF-8Curl JSON encoded in UTF-8How to only keep BMP in the utf-8 text file?
Emoticons seem to be specified using a format of U+xxxxx
wherein each x is a hexadecimal digit.
For example, U+1F615 is the official Unicode Consortium code for the "confused face" 😕
As I am often confused, I have a strong affinity for this symbol.
The U+1F615 representation is confusing to me because I thought the only encodings possible for unicode characters required 8, 16, 24 or 32 bits, whereas 5 hex digits require 5x4=20 bits.
I've discovered that this symbol seems to be represented by a completely different hex string in bash:
$echo -n 😕 | hexdump
0000000 f0 9f 98 95
0000004
$echo -e "xf0x9fx98x95"
😕
$PS1=$'xf0x9fx98x95 >'
😕 >
I would have expected U+1F615 to convert to something like x00 x01 xF6 x15.
I don't see the relationship between these 2 encodings?
When I lookup a symbol in the official Unicode Consortium list, I would like to be able to use that code directly without having to manually convert it in this tedious fashion. i.e.
- finding the symbol on some web page
- copying it to the clipboard of the web browser
- pasting it in bash to echo through a hexdump to discover the REAL code.
Can I use this 20-bit code to determine what the 32-bit code is?
Does a relationship exist between these 2 numbers?
shell character-encoding unicode
add a comment |
Emoticons seem to be specified using a format of U+xxxxx
wherein each x is a hexadecimal digit.
For example, U+1F615 is the official Unicode Consortium code for the "confused face" 😕
As I am often confused, I have a strong affinity for this symbol.
The U+1F615 representation is confusing to me because I thought the only encodings possible for unicode characters required 8, 16, 24 or 32 bits, whereas 5 hex digits require 5x4=20 bits.
I've discovered that this symbol seems to be represented by a completely different hex string in bash:
$echo -n 😕 | hexdump
0000000 f0 9f 98 95
0000004
$echo -e "xf0x9fx98x95"
😕
$PS1=$'xf0x9fx98x95 >'
😕 >
I would have expected U+1F615 to convert to something like x00 x01 xF6 x15.
I don't see the relationship between these 2 encodings?
When I lookup a symbol in the official Unicode Consortium list, I would like to be able to use that code directly without having to manually convert it in this tedious fashion. i.e.
- finding the symbol on some web page
- copying it to the clipboard of the web browser
- pasting it in bash to echo through a hexdump to discover the REAL code.
Can I use this 20-bit code to determine what the 32-bit code is?
Does a relationship exist between these 2 numbers?
shell character-encoding unicode
add a comment |
Emoticons seem to be specified using a format of U+xxxxx
wherein each x is a hexadecimal digit.
For example, U+1F615 is the official Unicode Consortium code for the "confused face" 😕
As I am often confused, I have a strong affinity for this symbol.
The U+1F615 representation is confusing to me because I thought the only encodings possible for unicode characters required 8, 16, 24 or 32 bits, whereas 5 hex digits require 5x4=20 bits.
I've discovered that this symbol seems to be represented by a completely different hex string in bash:
$echo -n 😕 | hexdump
0000000 f0 9f 98 95
0000004
$echo -e "xf0x9fx98x95"
😕
$PS1=$'xf0x9fx98x95 >'
😕 >
I would have expected U+1F615 to convert to something like x00 x01 xF6 x15.
I don't see the relationship between these 2 encodings?
When I lookup a symbol in the official Unicode Consortium list, I would like to be able to use that code directly without having to manually convert it in this tedious fashion. i.e.
- finding the symbol on some web page
- copying it to the clipboard of the web browser
- pasting it in bash to echo through a hexdump to discover the REAL code.
Can I use this 20-bit code to determine what the 32-bit code is?
Does a relationship exist between these 2 numbers?
shell character-encoding unicode
Emoticons seem to be specified using a format of U+xxxxx
wherein each x is a hexadecimal digit.
For example, U+1F615 is the official Unicode Consortium code for the "confused face" 😕
As I am often confused, I have a strong affinity for this symbol.
The U+1F615 representation is confusing to me because I thought the only encodings possible for unicode characters required 8, 16, 24 or 32 bits, whereas 5 hex digits require 5x4=20 bits.
I've discovered that this symbol seems to be represented by a completely different hex string in bash:
$echo -n 😕 | hexdump
0000000 f0 9f 98 95
0000004
$echo -e "xf0x9fx98x95"
😕
$PS1=$'xf0x9fx98x95 >'
😕 >
I would have expected U+1F615 to convert to something like x00 x01 xF6 x15.
I don't see the relationship between these 2 encodings?
When I lookup a symbol in the official Unicode Consortium list, I would like to be able to use that code directly without having to manually convert it in this tedious fashion. i.e.
- finding the symbol on some web page
- copying it to the clipboard of the web browser
- pasting it in bash to echo through a hexdump to discover the REAL code.
Can I use this 20-bit code to determine what the 32-bit code is?
Does a relationship exist between these 2 numbers?
shell character-encoding unicode
shell character-encoding unicode
edited Jan 19 '16 at 12:59
tarleb
1,517619
1,517619
asked Dec 30 '15 at 6:33
Alex RyanAlex Ryan
19115
19115
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
UTF-8
is a variable length encoding of Unicode. It is designed to be superset of ASCII. See Wikipedia for details of the encoding. x00 x01 xF6 x15
would be UCS-4BE
or UTF-32BE
encoding.
To get from the Unicode code point to the UTF-8 encoding, assuming the locale's charmap is UTF-8 (see the output of locale charmap
), it's just:
$ printf 'U1F615n'
😕
$ echo -e 'U1F615'
😕
$ confused_face=$'U1F615'
The latter will be in the next version of the POSIX standard.
AFAIK, that syntax was introduced in 2000 by the stand-alone GNU printf
utility (as opposed to the printf
utility of the GNU shell), brought to echo
/printf
/$'...'
builtins first by zsh
in 2003, ksh93 in 2004, bash in 2010, but was obviously inspired by other languages.
ksh93
also supports it as printf 'x1f615n'
and printf 'u1f615n'
.
$'uXXXX'
and $'UXXXXXXXX'
are supported by zsh
, bash
, ksh93
, mksh
and FreeBSD sh
, GNU printf
, GNU echo
.
Some require all the digits (as in U0001F615
as opposed to U1F615
) though that's likely to change in future versions as POSIX will allow fewer digits. In any case, you need all the digits if the UXXXXXXXX
is to be followed by hexadecimal digits as in U0001F615FOX
, as U1F615FOX
would have been $'U001F615F'OX
.
Some expand to the characters in the current locale's encoding at the time the string is parsed or at the time it is expanded, some only in UTF-8 regardless of the locale. If the character is not available in the current locale's encoding, the behaviour varies between shells.
So, for best portability, best is to only use it in UTF-8 locales and use all the digits, and use it in $'...'
:
printf '%sn' $'U0001F615'
Note that:
LC_ALL=C.UTF-8; printf '%sn' $'U0001F615'
or:
LC_ALL=C.UTF-8
printf '%sn' $'U0001F615'
Will not work with all shells (including bash
) because the $'U0001F615'
is parsed before LC_ALL
is assigned. (also note that there's no guarantee that a system will have a locale called C.UTF-8
)
You'd need:
LC_ALL=C.UTF-8; eval "confused_face=$'U0001F615'"
Or:
LC_ALL=C.UTF-8
printf '%sn' $'U0001F615'
(not within a compound command or function).
For the reverse, to get from the UTF-8 encoding to the Unicode code-point, see this other question or that one.
$ unicode 😕
U+1F615 CONFUSED FACE
UTF-8: f0 9f 98 95 UTF-16BE: d83dde15 Decimal: 😕
😕
Category: So (Symbol, Other)
Bidi: ON (Other Neutrals)
$ perl -CA -le 'printf "%xn", ord shift' 😕
1f615
2
Notice that ifU1F615
is followed by another valid hexadecimal digit then that will be assumed to be part of the escape sequence. To make it work regardless of what it is followed by it has to have enough leading zeros to be exactly eight digits long:U0001F615
– kasperd
Dec 30 '15 at 9:18
@kasperd, thanks. Yes, it's worth noting. I've included that in the answer.
– Stéphane Chazelas
Dec 30 '15 at 9:49
add a comment |
Here's a way to convert from UTF-32 (big endian) to UTF-8
$ confused=$(echo -ne "x0x01xF6x15" | iconv -f UTF-32BE -t UTF-8)
$ echo $confused
😕
You'll notice your hex value 0x01F615
in there, padded with an extra leading 0 to fill 32 bits.
The Wikipedia page on UTF-8 explains the transformation from a Unicode codepoint to its UTF-8 representation very clearly. But trying to do it yourself in shell scripting might not be the best idea.
UTF-32 is fixed-width, and the correspondence between codepoint and UTF-32 representation is trivial - the value is the same.
add a comment |
Nice way to do it in your head or on paper:
Figure out how many bytes it will be: values under U+0080 are one byte, else under U+0800 are 2 bytes, else under U+10000 are 3 bytes, else 4 bytes. In your case, 4 bytes.
Convert hex to octal:
0373025
.Starting at the end, peel off 2 octal digits at a time to get a sequence of octal values:
037
030
025
.If you have fewer octal values than the expected number of bytes, add an extra 0 at the beginning:
000
037
030
025
.For all but the first, add on
0200
to get:000
0237
0230
0225
.For the first, add
0300
if the expected length is 2,0340
if it's 3, or0360
if it's 4, to get:360
0237
0230
0225
.
Now write as a string of octal escapes: 360237230225
. Optionally convert back to hex if you want.
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f252286%2fhow-to-convert-an-emoticon-specified-by-a-uxxxxx-code-to-utf-8%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
UTF-8
is a variable length encoding of Unicode. It is designed to be superset of ASCII. See Wikipedia for details of the encoding. x00 x01 xF6 x15
would be UCS-4BE
or UTF-32BE
encoding.
To get from the Unicode code point to the UTF-8 encoding, assuming the locale's charmap is UTF-8 (see the output of locale charmap
), it's just:
$ printf 'U1F615n'
😕
$ echo -e 'U1F615'
😕
$ confused_face=$'U1F615'
The latter will be in the next version of the POSIX standard.
AFAIK, that syntax was introduced in 2000 by the stand-alone GNU printf
utility (as opposed to the printf
utility of the GNU shell), brought to echo
/printf
/$'...'
builtins first by zsh
in 2003, ksh93 in 2004, bash in 2010, but was obviously inspired by other languages.
ksh93
also supports it as printf 'x1f615n'
and printf 'u1f615n'
.
$'uXXXX'
and $'UXXXXXXXX'
are supported by zsh
, bash
, ksh93
, mksh
and FreeBSD sh
, GNU printf
, GNU echo
.
Some require all the digits (as in U0001F615
as opposed to U1F615
) though that's likely to change in future versions as POSIX will allow fewer digits. In any case, you need all the digits if the UXXXXXXXX
is to be followed by hexadecimal digits as in U0001F615FOX
, as U1F615FOX
would have been $'U001F615F'OX
.
Some expand to the characters in the current locale's encoding at the time the string is parsed or at the time it is expanded, some only in UTF-8 regardless of the locale. If the character is not available in the current locale's encoding, the behaviour varies between shells.
So, for best portability, best is to only use it in UTF-8 locales and use all the digits, and use it in $'...'
:
printf '%sn' $'U0001F615'
Note that:
LC_ALL=C.UTF-8; printf '%sn' $'U0001F615'
or:
LC_ALL=C.UTF-8
printf '%sn' $'U0001F615'
Will not work with all shells (including bash
) because the $'U0001F615'
is parsed before LC_ALL
is assigned. (also note that there's no guarantee that a system will have a locale called C.UTF-8
)
You'd need:
LC_ALL=C.UTF-8; eval "confused_face=$'U0001F615'"
Or:
LC_ALL=C.UTF-8
printf '%sn' $'U0001F615'
(not within a compound command or function).
For the reverse, to get from the UTF-8 encoding to the Unicode code-point, see this other question or that one.
$ unicode 😕
U+1F615 CONFUSED FACE
UTF-8: f0 9f 98 95 UTF-16BE: d83dde15 Decimal: 😕
😕
Category: So (Symbol, Other)
Bidi: ON (Other Neutrals)
$ perl -CA -le 'printf "%xn", ord shift' 😕
1f615
2
Notice that ifU1F615
is followed by another valid hexadecimal digit then that will be assumed to be part of the escape sequence. To make it work regardless of what it is followed by it has to have enough leading zeros to be exactly eight digits long:U0001F615
– kasperd
Dec 30 '15 at 9:18
@kasperd, thanks. Yes, it's worth noting. I've included that in the answer.
– Stéphane Chazelas
Dec 30 '15 at 9:49
add a comment |
UTF-8
is a variable length encoding of Unicode. It is designed to be superset of ASCII. See Wikipedia for details of the encoding. x00 x01 xF6 x15
would be UCS-4BE
or UTF-32BE
encoding.
To get from the Unicode code point to the UTF-8 encoding, assuming the locale's charmap is UTF-8 (see the output of locale charmap
), it's just:
$ printf 'U1F615n'
😕
$ echo -e 'U1F615'
😕
$ confused_face=$'U1F615'
The latter will be in the next version of the POSIX standard.
AFAIK, that syntax was introduced in 2000 by the stand-alone GNU printf
utility (as opposed to the printf
utility of the GNU shell), brought to echo
/printf
/$'...'
builtins first by zsh
in 2003, ksh93 in 2004, bash in 2010, but was obviously inspired by other languages.
ksh93
also supports it as printf 'x1f615n'
and printf 'u1f615n'
.
$'uXXXX'
and $'UXXXXXXXX'
are supported by zsh
, bash
, ksh93
, mksh
and FreeBSD sh
, GNU printf
, GNU echo
.
Some require all the digits (as in U0001F615
as opposed to U1F615
) though that's likely to change in future versions as POSIX will allow fewer digits. In any case, you need all the digits if the UXXXXXXXX
is to be followed by hexadecimal digits as in U0001F615FOX
, as U1F615FOX
would have been $'U001F615F'OX
.
Some expand to the characters in the current locale's encoding at the time the string is parsed or at the time it is expanded, some only in UTF-8 regardless of the locale. If the character is not available in the current locale's encoding, the behaviour varies between shells.
So, for best portability, best is to only use it in UTF-8 locales and use all the digits, and use it in $'...'
:
printf '%sn' $'U0001F615'
Note that:
LC_ALL=C.UTF-8; printf '%sn' $'U0001F615'
or:
LC_ALL=C.UTF-8
printf '%sn' $'U0001F615'
Will not work with all shells (including bash
) because the $'U0001F615'
is parsed before LC_ALL
is assigned. (also note that there's no guarantee that a system will have a locale called C.UTF-8
)
You'd need:
LC_ALL=C.UTF-8; eval "confused_face=$'U0001F615'"
Or:
LC_ALL=C.UTF-8
printf '%sn' $'U0001F615'
(not within a compound command or function).
For the reverse, to get from the UTF-8 encoding to the Unicode code-point, see this other question or that one.
$ unicode 😕
U+1F615 CONFUSED FACE
UTF-8: f0 9f 98 95 UTF-16BE: d83dde15 Decimal: 😕
😕
Category: So (Symbol, Other)
Bidi: ON (Other Neutrals)
$ perl -CA -le 'printf "%xn", ord shift' 😕
1f615
2
Notice that ifU1F615
is followed by another valid hexadecimal digit then that will be assumed to be part of the escape sequence. To make it work regardless of what it is followed by it has to have enough leading zeros to be exactly eight digits long:U0001F615
– kasperd
Dec 30 '15 at 9:18
@kasperd, thanks. Yes, it's worth noting. I've included that in the answer.
– Stéphane Chazelas
Dec 30 '15 at 9:49
add a comment |
UTF-8
is a variable length encoding of Unicode. It is designed to be superset of ASCII. See Wikipedia for details of the encoding. x00 x01 xF6 x15
would be UCS-4BE
or UTF-32BE
encoding.
To get from the Unicode code point to the UTF-8 encoding, assuming the locale's charmap is UTF-8 (see the output of locale charmap
), it's just:
$ printf 'U1F615n'
😕
$ echo -e 'U1F615'
😕
$ confused_face=$'U1F615'
The latter will be in the next version of the POSIX standard.
AFAIK, that syntax was introduced in 2000 by the stand-alone GNU printf
utility (as opposed to the printf
utility of the GNU shell), brought to echo
/printf
/$'...'
builtins first by zsh
in 2003, ksh93 in 2004, bash in 2010, but was obviously inspired by other languages.
ksh93
also supports it as printf 'x1f615n'
and printf 'u1f615n'
.
$'uXXXX'
and $'UXXXXXXXX'
are supported by zsh
, bash
, ksh93
, mksh
and FreeBSD sh
, GNU printf
, GNU echo
.
Some require all the digits (as in U0001F615
as opposed to U1F615
) though that's likely to change in future versions as POSIX will allow fewer digits. In any case, you need all the digits if the UXXXXXXXX
is to be followed by hexadecimal digits as in U0001F615FOX
, as U1F615FOX
would have been $'U001F615F'OX
.
Some expand to the characters in the current locale's encoding at the time the string is parsed or at the time it is expanded, some only in UTF-8 regardless of the locale. If the character is not available in the current locale's encoding, the behaviour varies between shells.
So, for best portability, best is to only use it in UTF-8 locales and use all the digits, and use it in $'...'
:
printf '%sn' $'U0001F615'
Note that:
LC_ALL=C.UTF-8; printf '%sn' $'U0001F615'
or:
LC_ALL=C.UTF-8
printf '%sn' $'U0001F615'
Will not work with all shells (including bash
) because the $'U0001F615'
is parsed before LC_ALL
is assigned. (also note that there's no guarantee that a system will have a locale called C.UTF-8
)
You'd need:
LC_ALL=C.UTF-8; eval "confused_face=$'U0001F615'"
Or:
LC_ALL=C.UTF-8
printf '%sn' $'U0001F615'
(not within a compound command or function).
For the reverse, to get from the UTF-8 encoding to the Unicode code-point, see this other question or that one.
$ unicode 😕
U+1F615 CONFUSED FACE
UTF-8: f0 9f 98 95 UTF-16BE: d83dde15 Decimal: 😕
😕
Category: So (Symbol, Other)
Bidi: ON (Other Neutrals)
$ perl -CA -le 'printf "%xn", ord shift' 😕
1f615
UTF-8
is a variable length encoding of Unicode. It is designed to be superset of ASCII. See Wikipedia for details of the encoding. x00 x01 xF6 x15
would be UCS-4BE
or UTF-32BE
encoding.
To get from the Unicode code point to the UTF-8 encoding, assuming the locale's charmap is UTF-8 (see the output of locale charmap
), it's just:
$ printf 'U1F615n'
😕
$ echo -e 'U1F615'
😕
$ confused_face=$'U1F615'
The latter will be in the next version of the POSIX standard.
AFAIK, that syntax was introduced in 2000 by the stand-alone GNU printf
utility (as opposed to the printf
utility of the GNU shell), brought to echo
/printf
/$'...'
builtins first by zsh
in 2003, ksh93 in 2004, bash in 2010, but was obviously inspired by other languages.
ksh93
also supports it as printf 'x1f615n'
and printf 'u1f615n'
.
$'uXXXX'
and $'UXXXXXXXX'
are supported by zsh
, bash
, ksh93
, mksh
and FreeBSD sh
, GNU printf
, GNU echo
.
Some require all the digits (as in U0001F615
as opposed to U1F615
) though that's likely to change in future versions as POSIX will allow fewer digits. In any case, you need all the digits if the UXXXXXXXX
is to be followed by hexadecimal digits as in U0001F615FOX
, as U1F615FOX
would have been $'U001F615F'OX
.
Some expand to the characters in the current locale's encoding at the time the string is parsed or at the time it is expanded, some only in UTF-8 regardless of the locale. If the character is not available in the current locale's encoding, the behaviour varies between shells.
So, for best portability, best is to only use it in UTF-8 locales and use all the digits, and use it in $'...'
:
printf '%sn' $'U0001F615'
Note that:
LC_ALL=C.UTF-8; printf '%sn' $'U0001F615'
or:
LC_ALL=C.UTF-8
printf '%sn' $'U0001F615'
Will not work with all shells (including bash
) because the $'U0001F615'
is parsed before LC_ALL
is assigned. (also note that there's no guarantee that a system will have a locale called C.UTF-8
)
You'd need:
LC_ALL=C.UTF-8; eval "confused_face=$'U0001F615'"
Or:
LC_ALL=C.UTF-8
printf '%sn' $'U0001F615'
(not within a compound command or function).
For the reverse, to get from the UTF-8 encoding to the Unicode code-point, see this other question or that one.
$ unicode 😕
U+1F615 CONFUSED FACE
UTF-8: f0 9f 98 95 UTF-16BE: d83dde15 Decimal: 😕
😕
Category: So (Symbol, Other)
Bidi: ON (Other Neutrals)
$ perl -CA -le 'printf "%xn", ord shift' 😕
1f615
edited yesterday
answered Dec 30 '15 at 7:59


Stéphane ChazelasStéphane Chazelas
311k57587945
311k57587945
2
Notice that ifU1F615
is followed by another valid hexadecimal digit then that will be assumed to be part of the escape sequence. To make it work regardless of what it is followed by it has to have enough leading zeros to be exactly eight digits long:U0001F615
– kasperd
Dec 30 '15 at 9:18
@kasperd, thanks. Yes, it's worth noting. I've included that in the answer.
– Stéphane Chazelas
Dec 30 '15 at 9:49
add a comment |
2
Notice that ifU1F615
is followed by another valid hexadecimal digit then that will be assumed to be part of the escape sequence. To make it work regardless of what it is followed by it has to have enough leading zeros to be exactly eight digits long:U0001F615
– kasperd
Dec 30 '15 at 9:18
@kasperd, thanks. Yes, it's worth noting. I've included that in the answer.
– Stéphane Chazelas
Dec 30 '15 at 9:49
2
2
Notice that if
U1F615
is followed by another valid hexadecimal digit then that will be assumed to be part of the escape sequence. To make it work regardless of what it is followed by it has to have enough leading zeros to be exactly eight digits long: U0001F615
– kasperd
Dec 30 '15 at 9:18
Notice that if
U1F615
is followed by another valid hexadecimal digit then that will be assumed to be part of the escape sequence. To make it work regardless of what it is followed by it has to have enough leading zeros to be exactly eight digits long: U0001F615
– kasperd
Dec 30 '15 at 9:18
@kasperd, thanks. Yes, it's worth noting. I've included that in the answer.
– Stéphane Chazelas
Dec 30 '15 at 9:49
@kasperd, thanks. Yes, it's worth noting. I've included that in the answer.
– Stéphane Chazelas
Dec 30 '15 at 9:49
add a comment |
Here's a way to convert from UTF-32 (big endian) to UTF-8
$ confused=$(echo -ne "x0x01xF6x15" | iconv -f UTF-32BE -t UTF-8)
$ echo $confused
😕
You'll notice your hex value 0x01F615
in there, padded with an extra leading 0 to fill 32 bits.
The Wikipedia page on UTF-8 explains the transformation from a Unicode codepoint to its UTF-8 representation very clearly. But trying to do it yourself in shell scripting might not be the best idea.
UTF-32 is fixed-width, and the correspondence between codepoint and UTF-32 representation is trivial - the value is the same.
add a comment |
Here's a way to convert from UTF-32 (big endian) to UTF-8
$ confused=$(echo -ne "x0x01xF6x15" | iconv -f UTF-32BE -t UTF-8)
$ echo $confused
😕
You'll notice your hex value 0x01F615
in there, padded with an extra leading 0 to fill 32 bits.
The Wikipedia page on UTF-8 explains the transformation from a Unicode codepoint to its UTF-8 representation very clearly. But trying to do it yourself in shell scripting might not be the best idea.
UTF-32 is fixed-width, and the correspondence between codepoint and UTF-32 representation is trivial - the value is the same.
add a comment |
Here's a way to convert from UTF-32 (big endian) to UTF-8
$ confused=$(echo -ne "x0x01xF6x15" | iconv -f UTF-32BE -t UTF-8)
$ echo $confused
😕
You'll notice your hex value 0x01F615
in there, padded with an extra leading 0 to fill 32 bits.
The Wikipedia page on UTF-8 explains the transformation from a Unicode codepoint to its UTF-8 representation very clearly. But trying to do it yourself in shell scripting might not be the best idea.
UTF-32 is fixed-width, and the correspondence between codepoint and UTF-32 representation is trivial - the value is the same.
Here's a way to convert from UTF-32 (big endian) to UTF-8
$ confused=$(echo -ne "x0x01xF6x15" | iconv -f UTF-32BE -t UTF-8)
$ echo $confused
😕
You'll notice your hex value 0x01F615
in there, padded with an extra leading 0 to fill 32 bits.
The Wikipedia page on UTF-8 explains the transformation from a Unicode codepoint to its UTF-8 representation very clearly. But trying to do it yourself in shell scripting might not be the best idea.
UTF-32 is fixed-width, and the correspondence between codepoint and UTF-32 representation is trivial - the value is the same.
answered Dec 30 '15 at 7:16


MatMat
39.8k8123128
39.8k8123128
add a comment |
add a comment |
Nice way to do it in your head or on paper:
Figure out how many bytes it will be: values under U+0080 are one byte, else under U+0800 are 2 bytes, else under U+10000 are 3 bytes, else 4 bytes. In your case, 4 bytes.
Convert hex to octal:
0373025
.Starting at the end, peel off 2 octal digits at a time to get a sequence of octal values:
037
030
025
.If you have fewer octal values than the expected number of bytes, add an extra 0 at the beginning:
000
037
030
025
.For all but the first, add on
0200
to get:000
0237
0230
0225
.For the first, add
0300
if the expected length is 2,0340
if it's 3, or0360
if it's 4, to get:360
0237
0230
0225
.
Now write as a string of octal escapes: 360237230225
. Optionally convert back to hex if you want.
add a comment |
Nice way to do it in your head or on paper:
Figure out how many bytes it will be: values under U+0080 are one byte, else under U+0800 are 2 bytes, else under U+10000 are 3 bytes, else 4 bytes. In your case, 4 bytes.
Convert hex to octal:
0373025
.Starting at the end, peel off 2 octal digits at a time to get a sequence of octal values:
037
030
025
.If you have fewer octal values than the expected number of bytes, add an extra 0 at the beginning:
000
037
030
025
.For all but the first, add on
0200
to get:000
0237
0230
0225
.For the first, add
0300
if the expected length is 2,0340
if it's 3, or0360
if it's 4, to get:360
0237
0230
0225
.
Now write as a string of octal escapes: 360237230225
. Optionally convert back to hex if you want.
add a comment |
Nice way to do it in your head or on paper:
Figure out how many bytes it will be: values under U+0080 are one byte, else under U+0800 are 2 bytes, else under U+10000 are 3 bytes, else 4 bytes. In your case, 4 bytes.
Convert hex to octal:
0373025
.Starting at the end, peel off 2 octal digits at a time to get a sequence of octal values:
037
030
025
.If you have fewer octal values than the expected number of bytes, add an extra 0 at the beginning:
000
037
030
025
.For all but the first, add on
0200
to get:000
0237
0230
0225
.For the first, add
0300
if the expected length is 2,0340
if it's 3, or0360
if it's 4, to get:360
0237
0230
0225
.
Now write as a string of octal escapes: 360237230225
. Optionally convert back to hex if you want.
Nice way to do it in your head or on paper:
Figure out how many bytes it will be: values under U+0080 are one byte, else under U+0800 are 2 bytes, else under U+10000 are 3 bytes, else 4 bytes. In your case, 4 bytes.
Convert hex to octal:
0373025
.Starting at the end, peel off 2 octal digits at a time to get a sequence of octal values:
037
030
025
.If you have fewer octal values than the expected number of bytes, add an extra 0 at the beginning:
000
037
030
025
.For all but the first, add on
0200
to get:000
0237
0230
0225
.For the first, add
0300
if the expected length is 2,0340
if it's 3, or0360
if it's 4, to get:360
0237
0230
0225
.
Now write as a string of octal escapes: 360237230225
. Optionally convert back to hex if you want.
edited Dec 30 '15 at 21:33
answered Dec 30 '15 at 15:53
R..R..
1,91411016
1,91411016
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f252286%2fhow-to-convert-an-emoticon-specified-by-a-uxxxxx-code-to-utf-8%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown