読者です 読者をやめる 読者になる 読者になる

素敵なおひげですね

PowerShellを中心に気分で書いているブログです。

PowerShell on Linux(Mac)でShift-JISを扱う

PowerShell Linux Mac

つい先日PowerShellオープンソース化、クロスプラットフォーム化されてとてもうれしい限りです。

github.com

この件について思うところは結構あるのですが、情報が多くてまだ自分のなかで考えをまとめきれていません(
考えがまとまったらブログに書こうと思います。

PowerShell on Linux(Mac)でShift-JISを扱う

で、現在CentOSおよびMacOS版のPowerShellを適当に触っていたりするのですが、日本人的にShift-JISを扱いたいという要求は多そうだろうなと思い、取り急ぎこのエントリだけでも書こうと思った次第です。

.NET Coreで扱える文字コード

PowerShell on Linux(およびMac)は.NET Core上で動作するアプリケーションです。
よって扱える文字コードは.NET Coreで扱える文字コードと一緒になります。

.NET Coreで扱える文字コードについては@ishisakaさんのブログの以下の記事が詳しいです。

opcdiary.net

こちらの記事にある通り、.NET Coreでは標準ではShift-JIS(MS932)はサポートされておらず、System.Text.Encoding.RegisterProviderを呼ぶ必要があります。

この辺の事情はUWP向けですが以下の記事に詳しく書かれています。

www.atmarkit.co.jp

PowerShell on Linux(Mac)でShift-JISを扱う

このため、PowerShell on Linux(Mac)でShift-JISを扱うには以下のコードを呼ぶ必要があります。

# System.Text.Encoding.RegisterProviderメソッドを呼んでShift-JISを使用可能にする。
[System.Text.Encoding]::RegisterProvider([System.Text.CodePagesEncodingProvider]::Instance)

内容としてはC#のコードを単純にPowerShellで置き換えただけですので特に説明は不要かと思います。
おまじないだと思って実行すればよいでしょう。

【補足】PowerShell on Linux(Mac)で扱える文字コード一覧

補足として以下のコードを実行してPowerShell on Linux(Mac)で使える文字コードの一覧を取得してみます。
System.Text.Encoding.RegisterProviderを呼ぶ部分をコメントアウトするしないで結果を比較してみます。

Test-Encodings.ps1

#! /opt/microsoft/powershell/6.0.0-alpha.9/powershell

# おまじないを入れる or 入れない
# [System.Text.Encoding]::RegisterProvider([System.Text.CodePagesEncodingProvider]::Instance)

for ($i = 0; $i -lt 65535; $i++){
    try{
        $enc = [System.Text.Encoding]::GetEncoding($i)
        Write-Output ("{0}, {1}, {2}" -f $i, $enc.WebName, $enc.EncodingName)
    }
    catch{}
}

Linux (CentOS 7.1)の場合

おまじないのない場合(既定)の結果は以下。

# CentOS 7.1
PS /home/vagrant> ./Test-Encodings.ps1
0, utf-8, Unicode (UTF-8)
1200, utf-16, Unicode
1201, utf-16BE, Unicode (Big-Endian)
12000, utf-32, Unicode (UTF-32)
12001, utf-32BE, Unicode (UTF-32 Big-Endian)
20127, us-ascii, US-ASCII
28591, iso-8859-1, Western European (ISO)
65000, utf-7, Unicode (UTF-7)
65001, utf-8, Unicode (UTF-8)

おまじないを入れた場合の結果は以下。

# CentOS 7.1
PS /home/vagrant> ./Test-Encodings.ps1
0, utf-8, Unicode (UTF-8)
37, ibm037, IBM EBCDIC (US-Canada)
437, ibm437, OEM United States
500, ibm500, IBM EBCDIC (International)
708, asmo-708, Arabic (ASMO 708)
720, dos-720, Arabic (DOS)
737, ibm737, Greek (DOS)
775, ibm775, Baltic (DOS)
850, ibm850, Western European (DOS)
852, ibm852, Central European (DOS)
855, ibm855, OEM Cyrillic
857, ibm857, Turkish (DOS)
858, ibm00858, OEM Multilingual Latin I
860, ibm860, Portuguese (DOS)
861, ibm861, Icelandic (DOS)
862, dos-862, Hebrew (DOS)
863, ibm863, French Canadian (DOS)
864, ibm864, Arabic (864)
865, ibm865, Nordic (DOS)
866, cp866, Cyrillic (DOS)
869, ibm869, Greek, Modern (DOS)
870, ibm870, IBM EBCDIC (Multilingual Latin-2)
874, windows-874, Thai (Windows)
875, cp875, IBM EBCDIC (Greek Modern)
932, shift_jis, Japanese (Shift-JIS)
936, gb2312, Chinese Simplified (GB2312)
949, ks_c_5601-1987, Korean
950, big5, Chinese Traditional (Big5)
1026, ibm1026, IBM EBCDIC (Turkish Latin-5)
1047, ibm01047, IBM Latin-1
1140, ibm01140, IBM EBCDIC (US-Canada-Euro)
1141, ibm01141, IBM EBCDIC (Germany-Euro)
1142, ibm01142, IBM EBCDIC (Denmark-Norway-Euro)
1143, ibm01143, IBM EBCDIC (Finland-Sweden-Euro)
1144, ibm01144, IBM EBCDIC (Italy-Euro)
1145, ibm01145, IBM EBCDIC (Spain-Euro)
1146, ibm01146, IBM EBCDIC (UK-Euro)
1147, ibm01147, IBM EBCDIC (France-Euro)
1148, ibm01148, IBM EBCDIC (International-Euro)
1149, ibm01149, IBM EBCDIC (Icelandic-Euro)
1200, utf-16, Unicode
1201, utf-16BE, Unicode (Big-Endian)
1250, windows-1250, Central European (Windows)
1251, windows-1251, Cyrillic (Windows)
1252, windows-1252, Western European (Windows)
1253, windows-1253, Greek (Windows)
1254, windows-1254, Turkish (Windows)
1255, windows-1255, Hebrew (Windows)
1256, windows-1256, Arabic (Windows)
1257, windows-1257, Baltic (Windows)
1258, windows-1258, Vietnamese (Windows)
1361, johab, Korean (Johab)
10000, macintosh, Western European (Mac)
10001, x-mac-japanese, Japanese (Mac)
10002, x-mac-chinesetrad, Chinese Traditional (Mac)
10003, x-mac-korean, Korean (Mac)
10004, x-mac-arabic, Arabic (Mac)
10005, x-mac-hebrew, Hebrew (Mac)
10006, x-mac-greek, Greek (Mac)
10007, x-mac-cyrillic, Cyrillic (Mac)
10008, x-mac-chinesesimp, Chinese Simplified (Mac)
10010, x-mac-romanian, Romanian (Mac)
10017, x-mac-ukrainian, Ukrainian (Mac)
10021, x-mac-thai, Thai (Mac)
10029, x-mac-ce, Central European (Mac)
10079, x-mac-icelandic, Icelandic (Mac)
10081, x-mac-turkish, Turkish (Mac)
10082, x-mac-croatian, Croatian (Mac)
12000, utf-32, Unicode (UTF-32)
12001, utf-32BE, Unicode (UTF-32 Big-Endian)
20000, x-chinese-cns, Chinese Traditional (CNS)
20001, x-cp20001, TCA Taiwan
20002, x-chinese-eten, Chinese Traditional (Eten)
20003, x-cp20003, IBM5550 Taiwan
20004, x-cp20004, TeleText Taiwan
20005, x-cp20005, Wang Taiwan
20105, x-ia5, Western European (IA5)
20106, x-ia5-german, German (IA5)
20107, x-ia5-swedish, Swedish (IA5)
20108, x-ia5-norwegian, Norwegian (IA5)
20127, us-ascii, US-ASCII
20261, x-cp20261, T.61
20269, x-cp20269, ISO-6937
20273, ibm273, IBM EBCDIC (Germany)
20277, ibm277, IBM EBCDIC (Denmark-Norway)
20278, ibm278, IBM EBCDIC (Finland-Sweden)
20280, ibm280, IBM EBCDIC (Italy)
20284, ibm284, IBM EBCDIC (Spain)
20285, ibm285, IBM EBCDIC (UK)
20290, ibm290, IBM EBCDIC (Japanese katakana)
20297, ibm297, IBM EBCDIC (France)
20420, ibm420, IBM EBCDIC (Arabic)
20423, ibm423, IBM EBCDIC (Greek)
20424, ibm424, IBM EBCDIC (Hebrew)
20833, x-ebcdic-koreanextended, IBM EBCDIC (Korean Extended)
20838, ibm-thai, IBM EBCDIC (Thai)
20866, koi8-r, Cyrillic (KOI8-R)
20871, ibm871, IBM EBCDIC (Icelandic)
20880, ibm880, IBM EBCDIC (Cyrillic Russian)
20905, ibm905, IBM EBCDIC (Turkish)
20924, ibm00924, IBM Latin-1
20932, euc-jp, Japanese (JIS 0208-1990 and 0212-1990)
20936, x-cp20936, Chinese Simplified (GB2312-80)
20949, x-cp20949, Korean Wansung
21025, cp1025, IBM EBCDIC (Cyrillic Serbian-Bulgarian)
21866, koi8-u, Cyrillic (KOI8-U)
28591, iso-8859-1, Western European (ISO)
28592, iso-8859-2, Central European (ISO)
28593, iso-8859-3, Latin 3 (ISO)
28594, iso-8859-4, Baltic (ISO)
28595, iso-8859-5, Cyrillic (ISO)
28596, iso-8859-6, Arabic (ISO)
28597, iso-8859-7, Greek (ISO)
28598, iso-8859-8, Hebrew (ISO-Visual)
28599, iso-8859-9, Turkish (ISO)
28603, iso-8859-13, Estonian (ISO)
28605, iso-8859-15, Latin 9 (ISO)
29001, x-europa, Europa
38598, iso-8859-8-i, Hebrew (ISO-Logical)
50220, iso-2022-jp, Japanese (JIS)
50221, csiso2022jp, Japanese (JIS-Allow 1 byte Kana)
50222, iso-2022-jp, Japanese (JIS-Allow 1 byte Kana - SO/SI)
50225, iso-2022-kr, Korean (ISO)
50227, x-cp50227, Chinese Simplified (ISO-2022)
51932, euc-jp, Japanese (EUC)
51936, euc-cn, Chinese Simplified (EUC)
51949, euc-kr, Korean (EUC)
52936, hz-gb-2312, Chinese Simplified (HZ)
54936, gb18030, Chinese Simplified (GB18030)
57002, x-iscii-de, ISCII Devanagari
57003, x-iscii-be, ISCII Bengali
57004, x-iscii-ta, ISCII Tamil
57005, x-iscii-te, ISCII Telugu
57006, x-iscii-as, ISCII Assamese
57007, x-iscii-or, ISCII Oriya
57008, x-iscii-ka, ISCII Kannada
57009, x-iscii-ma, ISCII Malayalam
57010, x-iscii-gu, ISCII Gujarati
57011, x-iscii-pa, ISCII Punjabi
65000, utf-7, Unicode (UTF-7)
65001, utf-8, Unicode (UTF-8)

MacOSの場合

Macの場合はShebang

#! /usr/bin/local/powershell

に変えてください。
おまじないのない場合(既定)の結果は以下。

# OS X 10.11.6
PS /Users/stknohg> ./Test-Encoding.ps1                                            
0, utf-8, Unicode (UTF-8)
1200, utf-16, Unicode
1201, utf-16BE, Unicode (Big-Endian)
12000, utf-32, Unicode (UTF-32)
12001, utf-32BE, Unicode (UTF-32 Big-Endian)
20127, us-ascii, US-ASCII
28591, iso-8859-1, Western European (ISO)
65000, utf-7, Unicode (UTF-7)
65001, utf-8, Unicode (UTF-8)

おまじないを入れた場合の結果は以下。

# OS X 10.11.6
PS /Users/stknohg> ./Test-Encoding.ps1                                            
0, utf-8, Unicode (UTF-8)
37, ibm037, IBM EBCDIC (US-Canada)
437, ibm437, OEM United States
500, ibm500, IBM EBCDIC (International)
708, asmo-708, Arabic (ASMO 708)
720, dos-720, Arabic (DOS)
737, ibm737, Greek (DOS)
775, ibm775, Baltic (DOS)
850, ibm850, Western European (DOS)
852, ibm852, Central European (DOS)
855, ibm855, OEM Cyrillic
857, ibm857, Turkish (DOS)
858, ibm00858, OEM Multilingual Latin I
860, ibm860, Portuguese (DOS)
861, ibm861, Icelandic (DOS)
862, dos-862, Hebrew (DOS)
863, ibm863, French Canadian (DOS)
864, ibm864, Arabic (864)
865, ibm865, Nordic (DOS)
866, cp866, Cyrillic (DOS)
869, ibm869, Greek, Modern (DOS)
870, ibm870, IBM EBCDIC (Multilingual Latin-2)
874, windows-874, Thai (Windows)
875, cp875, IBM EBCDIC (Greek Modern)
932, shift_jis, Japanese (Shift-JIS)
936, gb2312, Chinese Simplified (GB2312)
949, ks_c_5601-1987, Korean
950, big5, Chinese Traditional (Big5)
1026, ibm1026, IBM EBCDIC (Turkish Latin-5)
1047, ibm01047, IBM Latin-1
1140, ibm01140, IBM EBCDIC (US-Canada-Euro)
1141, ibm01141, IBM EBCDIC (Germany-Euro)
1142, ibm01142, IBM EBCDIC (Denmark-Norway-Euro)
1143, ibm01143, IBM EBCDIC (Finland-Sweden-Euro)
1144, ibm01144, IBM EBCDIC (Italy-Euro)
1145, ibm01145, IBM EBCDIC (Spain-Euro)
1146, ibm01146, IBM EBCDIC (UK-Euro)
1147, ibm01147, IBM EBCDIC (France-Euro)
1148, ibm01148, IBM EBCDIC (International-Euro)
1149, ibm01149, IBM EBCDIC (Icelandic-Euro)
1200, utf-16, Unicode
1201, utf-16BE, Unicode (Big-Endian)
1250, windows-1250, Central European (Windows)
1251, windows-1251, Cyrillic (Windows)
1252, windows-1252, Western European (Windows)
1253, windows-1253, Greek (Windows)
1254, windows-1254, Turkish (Windows)
1255, windows-1255, Hebrew (Windows)
1256, windows-1256, Arabic (Windows)
1257, windows-1257, Baltic (Windows)
1258, windows-1258, Vietnamese (Windows)
1361, johab, Korean (Johab)
10000, macintosh, Western European (Mac)
10001, x-mac-japanese, Japanese (Mac)
10002, x-mac-chinesetrad, Chinese Traditional (Mac)
10003, x-mac-korean, Korean (Mac)
10004, x-mac-arabic, Arabic (Mac)
10005, x-mac-hebrew, Hebrew (Mac)
10006, x-mac-greek, Greek (Mac)
10007, x-mac-cyrillic, Cyrillic (Mac)
10008, x-mac-chinesesimp, Chinese Simplified (Mac)
10010, x-mac-romanian, Romanian (Mac)
10017, x-mac-ukrainian, Ukrainian (Mac)
10021, x-mac-thai, Thai (Mac)
10029, x-mac-ce, Central European (Mac)
10079, x-mac-icelandic, Icelandic (Mac)
10081, x-mac-turkish, Turkish (Mac)
10082, x-mac-croatian, Croatian (Mac)
12000, utf-32, Unicode (UTF-32)
12001, utf-32BE, Unicode (UTF-32 Big-Endian)
20000, x-chinese-cns, Chinese Traditional (CNS)
20001, x-cp20001, TCA Taiwan
20002, x-chinese-eten, Chinese Traditional (Eten)
20003, x-cp20003, IBM5550 Taiwan
20004, x-cp20004, TeleText Taiwan
20005, x-cp20005, Wang Taiwan
20105, x-ia5, Western European (IA5)
20106, x-ia5-german, German (IA5)
20107, x-ia5-swedish, Swedish (IA5)
20108, x-ia5-norwegian, Norwegian (IA5)
20127, us-ascii, US-ASCII
20261, x-cp20261, T.61
20269, x-cp20269, ISO-6937
20273, ibm273, IBM EBCDIC (Germany)
20277, ibm277, IBM EBCDIC (Denmark-Norway)
20278, ibm278, IBM EBCDIC (Finland-Sweden)
20280, ibm280, IBM EBCDIC (Italy)
20284, ibm284, IBM EBCDIC (Spain)
20285, ibm285, IBM EBCDIC (UK)
20290, ibm290, IBM EBCDIC (Japanese katakana)
20297, ibm297, IBM EBCDIC (France)
20420, ibm420, IBM EBCDIC (Arabic)
20423, ibm423, IBM EBCDIC (Greek)
20424, ibm424, IBM EBCDIC (Hebrew)
20833, x-ebcdic-koreanextended, IBM EBCDIC (Korean Extended)
20838, ibm-thai, IBM EBCDIC (Thai)
20866, koi8-r, Cyrillic (KOI8-R)
20871, ibm871, IBM EBCDIC (Icelandic)
20880, ibm880, IBM EBCDIC (Cyrillic Russian)
20905, ibm905, IBM EBCDIC (Turkish)
20924, ibm00924, IBM Latin-1
20932, euc-jp, Japanese (JIS 0208-1990 and 0212-1990)
20936, x-cp20936, Chinese Simplified (GB2312-80)
20949, x-cp20949, Korean Wansung
21025, cp1025, IBM EBCDIC (Cyrillic Serbian-Bulgarian)
21866, koi8-u, Cyrillic (KOI8-U)
28591, iso-8859-1, Western European (ISO)
28592, iso-8859-2, Central European (ISO)
28593, iso-8859-3, Latin 3 (ISO)
28594, iso-8859-4, Baltic (ISO)
28595, iso-8859-5, Cyrillic (ISO)
28596, iso-8859-6, Arabic (ISO)
28597, iso-8859-7, Greek (ISO)
28598, iso-8859-8, Hebrew (ISO-Visual)
28599, iso-8859-9, Turkish (ISO)
28603, iso-8859-13, Estonian (ISO)
28605, iso-8859-15, Latin 9 (ISO)
29001, x-europa, Europa
38598, iso-8859-8-i, Hebrew (ISO-Logical)
50220, iso-2022-jp, Japanese (JIS)
50221, csiso2022jp, Japanese (JIS-Allow 1 byte Kana)
50222, iso-2022-jp, Japanese (JIS-Allow 1 byte Kana - SO/SI)
50225, iso-2022-kr, Korean (ISO)
50227, x-cp50227, Chinese Simplified (ISO-2022)
51932, euc-jp, Japanese (EUC)
51936, euc-cn, Chinese Simplified (EUC)
51949, euc-kr, Korean (EUC)
52936, hz-gb-2312, Chinese Simplified (HZ)
54936, gb18030, Chinese Simplified (GB18030)
57002, x-iscii-de, ISCII Devanagari
57003, x-iscii-be, ISCII Bengali
57004, x-iscii-ta, ISCII Tamil
57005, x-iscii-te, ISCII Telugu
57006, x-iscii-as, ISCII Assamese
57007, x-iscii-or, ISCII Oriya
57008, x-iscii-ka, ISCII Kannada
57009, x-iscii-ma, ISCII Malayalam
57010, x-iscii-gu, ISCII Gujarati
57011, x-iscii-pa, ISCII Punjabi
65000, utf-7, Unicode (UTF-7)
65001, utf-8, Unicode (UTF-8)