判断字符串(UTF-8编码)是否为中文、韩文、日文的LUA实现

xiaoxiao2021-02-28  138

首先把字符串拆分成单个字符 -- 拆分出单个字符 function stringToChars(str) -- 主要用了Unicode(UTF-8)编码的原理分隔字符串 -- 简单来说就是每个字符的第一位定义了该字符占据了多少字节 -- UTF-8的编码:它是一种变长的编码方式 -- 对于单字节的符号,字节的第一位设为0,后面7位为这个符号的unicode码。因此对于英语字母,UTF-8编码和ASCII码是相同的。 -- 对于n字节的符号(n>1),第一个字节的前n位都设为1,第n+1位设为0,后面字节的前两位一律设为10。 -- 剩下的没有提及的二进制位,全部为这个符号的unicode码。 local list = {} local len = string.len(str) local i = 1 while i <= len do local c = string.byte(str, i) local shift = 1 if c > 0 and c <= 127 then shift = 1 elseif (c >= 192 and c <= 223) then shift = 2 elseif (c >= 224 and c <= 239) then shift = 3 elseif (c >= 240 and c <= 247) then shift = 4 end local char = string.sub(str, i, i+shift-1) i = i + shift table.insert(list, char) end return list, len end 判断CJK编码 function isCJKCode(char) local len = string.len(char) local chInt = 0 for i = 1, len do local n = string.byte(char, i) chInt = chInt * 256 + n end -- (0x2E80 -- 0x2FDF) -- CJK Radicals Supplement & Kangxi Radicals -- (0x2FF0 -- 0x30FF) -- Ideographic Description Characters, CJK Symbols and Punctuation & Japanese -- (0x3100 -- 0x31BF) -- Korean -- (0x31C0 -- 0x4DFF) -- Other extensions -- (0x4E00 -- 0x9FBF) -- CJK Unified Ideographs -- (0xAC00 -- 0xD7AF) -- Hangul Syllables -- (0xF900 -- 0xFAFF) -- CJK Compatibility Ideographs -- (0xFE30 -- 0xFE4F) -- CJK Compatibility Forms return (chInt >= 14858880 and chInt <= 14860191) or (chInt >= 14860208 and chInt <= 14910399) or (chInt >= 14910592 and chInt <= 14911167) or (chInt >= 14911360 and chInt <= 14989247) or (chInt >= 14989440 and chInt <= 15318719) or (chInt >= 15380608 and chInt <= 15572655) or (chInt >= 15705216 and chInt <= 15707071) or (chInt >= 15710384 and chInt <= 15710607) end
转载请注明原文地址: https://www.6miu.com/read-32639.html

最新回复(0)