[Swift] Strings and Characters

티스토리 뷰

🍏/Swift

[Swift] Strings and Characters > Unicode

eungding 2021. 12. 12. 22:09

728x90

Swift Docs > Strings and Characters 에 나오는 Unicode 관련 내용 입니다. (순서를 조금 재구성하였습니다)

# Unicode

Swift의 String과 Character는 유니코드를 완벽하게 준수합니다.

또한 세가지 유니코드 표현 (또는 문자열 인코딩) 에 접근할 수 있는 프로퍼티를 제공합니다.

1. UTF-8 Representation

- A collection of UTF-8 code units

- utf8 property 로 접근 가능. 프로퍼티 타입은 UTF8View (= collection of unsigned 8-bit (UInt8) values)

2. UTF-16 Representation

- A collection of UTF-16 code units

- utf16 property 로 접근 가능. 프로퍼티 타입은 UTF16View (= collection of unsigned 16-bit (UInt16) values)

3. Unicode Scalar Representation (UTF-32 Representation)

- A collection of 21-bit Unicode scalar values, equivalent to the string’s UTF-32 encoding form

- unicodeScalars property 로 접근 가능. 프로퍼티 타입은 UnicodeScalarView (= collection of values of type UnicodeScalar)

(UTF-32 인데, 왜 32 bits 라고 안하고 21 bits 라고 할까?! 하고 검색해보니 위키 에 실제로 21 비트만 필요하다고 적혀있더라구요!)

Character, String은 Unicode Scalar Representation 기반이고

NSString은 UTF-16 Representation 기반입니다.

그래서 String의 count 와 NSString의 length가 항상 같지 않은 점을 개발하면서 유의해야하죠!

let dog: String = "🐶"
print(dog.utf8.count) // 4
print(dog.utf16.count) // 2
print(dog.unicodeScalars.count) // 1

print(dog.count) // 1
print(NSString(string: dog).length) // 2

# Unicode Representations of Strings

아래에서 각 Representation에 대해 더 살펴보겠습니다.

1. UTF-8 Representation

utf8 property 를 iterating 하며 UTF-8 representation 에 접근할 수 있습니다. decimal value (십진법) 를 줍니다.

참고로 Position은 print된 Code Unit 순서를 말합니다.

let dogString = "Dog‼🐶"
for codeUnit in dogString.utf8 {
    print("\(codeUnit) ", terminator: "")
}

// Prints "68 111 103 226 128 188 240 159 144 182"

2. UTF-16 Representation

utf16 property 를 iterating 하며 UTF-26 representation 에 접근할 수 있습니다.

let dogString = "Dog‼🐶"
for codeUnit in dogString.utf16 {
    print("\(codeUnit) ", terminator: "")
}
// Prints "68 111 103 8252 55357 56374"

3. Unicode Scalar Representation

unicodeScalars property를 iterating하며 Unicode scalar representation에 접근할 수 있습니다.

UnicodeScalar 의 value property를 이용해서 값을 출력하면 됩니다.

let dogString = "Dog‼🐶"
for scalar in dogString.unicodeScalars {
    print("\(scalar.value) ", terminator: "")
}
// Prints "68 111 103 8252 128054"

마지막 value는

- decimal: 128054

- hexadecimal: 1F436

- Unicode scalar: U+1F436

로 표현될 수 있습니다.

참고로 value property가 아니라 그냥 scalar 값을 출력하면 아래와 같습니다.

let dogString = "Dog‼🐶"
for scalar in dogString.unicodeScalars {
    print("\(scalar) ")
}
// D
// o
// g
// ‼
// 🐶

# Unicode Scalar Values

위에서 말했듯이, Swift의 native String type 은 Unicode scalar values 로 구축되어있습니다.

Unicode scalar value는 문자(character)나 수식어(modifier) 에 대한 unique 21-bit number 입니다.

예를들어

- U+0061 // for LATIN SMALL LETTER A ("a")

- U+1F425 // for FRONT-FACING BABY CHICK ("🐥")

모든 21-bit Unicode scalar values가 character로 할당되어있지 않습니다.

몇몇의 scalars 는 미래 할당 (future assignment) 또는 UTF-16 encoding (16-bit Unicode Transformation Format) 에 사용되기 위해 남겨져 있습니다.

character에 할당된 scalar values는 이름을 가지고 있습니다.

위의 예제에서 본 것처럼 LATIN SMALL LETTER A 또는 FRONT-FACING BABY CHICK 같은 이름 입니다.

# Extended Grapheme Clusters

Swift Character type의 모든 인스턴스는 하나의 extended grapheme cluster 를 표현합니다.

(grapheme = the smallest meaningful contrastive unit in a writing system)

extended grapheme cluster 는 하나 또는 하나 이상의 유니코드 스칼라로 이루어진 sequence 이고,

결합했을 시 human-readable 한 character를 만듭니다.

let eAcute: Character = "\u{E9}"                         // é
let combinedEAcute: Character = "\u{65}\u{301}"          // e followed by ́
// eAcute is é, combinedEAcute is é

Extended grapheme clusters 는 복잡한 script characters 를 sinlge character 로 나타낼 수 있는 유연한 방법입니다.

예를들어 한글은 precomposed 또는 decomposed sequence 로 표현될 수 있습니다.

두가지 표현은 모두 single Character를 표현합니다.

let precomposed: Character = "\u{D55C}"                  // 한
let decomposed: Character = "\u{1112}\u{1161}\u{11AB}"   // ᄒ, ᅡ, ᆫ
// precomposed is 한, decomposed is 한

Extended grapheme clusters 를 사용하면

COMBINING ENCLOSING CIRCLE (U+20DD) 같이 둘러싸는 스칼라가

단일 문자 값의 일부로 다른 유니코드 스칼라를 포함할 수 있습니다.

let enclosedEAcute: Character = "\u{E9}\u{20DD}"
// enclosedEAcute is é⃝

# Counting Characters

Swift가 Character values 에 대해 extended grapheme clusters 를 사용하기 때문에

string concatenation and modification 이 항상 string character count에 영향을 주지 않습니다.

예를들어 four-character 단어인 cafe를 살펴봅시다.

COMBINING ACUTE ACCENT (U+0301) 를 append 해서 cafe의 4번째 character가 é 가 되었고 여전히 count는 4 입니다.

var word = "cafe"
print("the number of characters in \(word) is \(word.count)")
// Prints "the number of characters in cafe is 4"

word += "\u{301}"    // COMBINING ACUTE ACCENT, U+0301

print("the number of characters in \(word) is \(word.count)")
// Prints "the number of characters in café is 4"

저작자표시 (새창열림)

'🍏 > Swift' 카테고리의 다른 글

[Swift] 함수 관련 혼용되는 용어 정리 (parameter name, argument label, argument value) (0)	2021.12.26
[Swift] Concurrency (2)	2021.12.14
[Swift] Substring (0)	2021.12.12
Swift, Dart, Python % Operator 비교 (0)	2021.12.07
[Swift] assert, precondition, fatalError (1)	2021.12.05

공유하기 링크

페이스북
카카오스토리
트위터

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

글 보관함

아기개발자의 성장일기

티스토리 뷰