A list of all available normalization forms. See www.unicode.org/reports/tr15/tr15-29.html for more information about normalization.
The Unicode version that is supported by the implementation
Hangul character boundaries and properties
All the unicode whitespace
BOM (byte order mark) can also be seen as whitespace, it’s a non-rendering character used to distinguish between little and big endian. This is not an issue in utf-8, so it must be ignored.
The default normalization used for operations that require normalization. It can be set to any of the normalizations in NORMALIZATION_FORMS.
Example:
ActiveSupport::Multibyte::Unicode.default_normalization_form = :c
Compose decomposed characters to the composed form.
# File lib/active_support/multibyte/unicode.rb, line 166
166: def compose_codepoints(codepoints)
167: pos = 0
168: eoa = codepoints.length - 1
169: starter_pos = 0
170: starter_char = codepoints[0]
171: previous_combining_class = 1
172: while pos < eoa
173: pos += 1
174: lindex = starter_char - HANGUL_LBASE
175: # -- Hangul
176: if 0 <= lindex and lindex < HANGUL_LCOUNT
177: vindex = codepoints[starter_pos+1] - HANGUL_VBASE rescue vindex = 1
178: if 0 <= vindex and vindex < HANGUL_VCOUNT
179: tindex = codepoints[starter_pos+2] - HANGUL_TBASE rescue tindex = 1
180: if 0 <= tindex and tindex < HANGUL_TCOUNT
181: j = starter_pos + 2
182: eoa -= 2
183: else
184: tindex = 0
185: j = starter_pos + 1
186: eoa -= 1
187: end
188: codepoints[starter_pos..j] = (lindex * HANGUL_VCOUNT + vindex) * HANGUL_TCOUNT + tindex + HANGUL_SBASE
189: end
190: starter_pos += 1
191: starter_char = codepoints[starter_pos]
192: # -- Other characters
193: else
194: current_char = codepoints[pos]
195: current = database.codepoints[current_char]
196: if current.combining_class > previous_combining_class
197: if ref = database.composition_map[starter_char]
198: composition = ref[current_char]
199: else
200: composition = nil
201: end
202: unless composition.nil?
203: codepoints[starter_pos] = composition
204: starter_char = composition
205: codepoints.delete_at pos
206: eoa -= 1
207: pos -= 1
208: previous_combining_class = 1
209: else
210: previous_combining_class = current.combining_class
211: end
212: else
213: previous_combining_class = current.combining_class
214: end
215: if current.combining_class == 0
216: starter_pos = pos
217: starter_char = codepoints[pos]
218: end
219: end
220: end
221: codepoints
222: end
Decompose composed characters to the decomposed form.
# File lib/active_support/multibyte/unicode.rb, line 145
145: def decompose_codepoints(type, codepoints)
146: codepoints.inject([]) do |decomposed, cp|
147: # if it's a hangul syllable starter character
148: if HANGUL_SBASE <= cp and cp < HANGUL_SLAST
149: sindex = cp - HANGUL_SBASE
150: ncp = [] # new codepoints
151: ncp << HANGUL_LBASE + sindex / HANGUL_NCOUNT
152: ncp << HANGUL_VBASE + (sindex % HANGUL_NCOUNT) / HANGUL_TCOUNT
153: tindex = sindex % HANGUL_TCOUNT
154: ncp << (HANGUL_TBASE + tindex) unless tindex == 0
155: decomposed.concat ncp
156: # if the codepoint is decomposable in with the current decomposition type
157: elsif (ncp = database.codepoints[cp].decomp_mapping) and (!database.codepoints[cp].decomp_type || type == :compatability)
158: decomposed.concat decompose_codepoints(type, ncp.dup)
159: else
160: decomposed << cp
161: end
162: end
163: end
Reverse operation of g_unpack.
Example:
Unicode.g_pack(Unicode.g_unpack('क्षि')) # => 'क्षि'
# File lib/active_support/multibyte/unicode.rb, line 124
124: def g_pack(unpacked)
125: (unpacked.flatten).pack('U*')
126: end
Unpack the string at grapheme boundaries. Returns a list of character lists.
Example:
Unicode.g_unpack('क्षि') # => [[2325, 2381], [2359], [2367]]
Unicode.g_unpack('Café') # => [[67], [97], [102], [233]]
# File lib/active_support/multibyte/unicode.rb, line 90
90: def g_unpack(string)
91: codepoints = u_unpack(string)
92: unpacked = []
93: pos = 0
94: marker = 0
95: eoc = codepoints.length
96: while(pos < eoc)
97: pos += 1
98: previous = codepoints[pos-1]
99: current = codepoints[pos]
100: if (
101: # CR X LF
102: ( previous == database.boundary[:cr] and current == database.boundary[:lf] ) or
103: # L X (L|V|LV|LVT)
104: ( database.boundary[:l] === previous and in_char_class?(current, [:l,:v,:lv,:lvt]) ) or
105: # (LV|V) X (V|T)
106: ( in_char_class?(previous, [:lv,:v]) and in_char_class?(current, [:v,:t]) ) or
107: # (LVT|T) X (T)
108: ( in_char_class?(previous, [:lvt,:t]) and database.boundary[:t] === current ) or
109: # X Extend
110: (database.boundary[:extend] === current)
111: )
112: else
113: unpacked << codepoints[marker..pos-1]
114: marker = pos
115: end
116: end
117: unpacked
118: end
Detect whether the codepoint is in a certain character class. Returns true when it’s in the specified character class and false otherwise. Valid character classes are: :cr, :lf, :l, :v, :lv, :lvt and :t.
Primarily used by the grapheme cluster support.
# File lib/active_support/multibyte/unicode.rb, line 81
81: def in_char_class?(codepoint, classes)
82: classes.detect { |c| database.boundary[c] === codepoint } ? true : false
83: end
Returns the KC normalization of the string by default. NFKC is considered the best normalization form for passing strings to databases and validations.
string - The string to perform normalization on.
form - The form you want to normalize in. Should be one of the following: :c, :kc, :d, or :kd. Default is ActiveSupport::Multibyte.default_normalization_form
# File lib/active_support/multibyte/unicode.rb, line 282
282: def normalize(string, form=nil)
283: form ||= @default_normalization_form
284: # See http://www.unicode.org/reports/tr15, Table 1
285: codepoints = u_unpack(string)
286: case form
287: when :d
288: reorder_characters(decompose_codepoints(:canonical, codepoints))
289: when :c
290: compose_codepoints(reorder_characters(decompose_codepoints(:canonical, codepoints)))
291: when :kd
292: reorder_characters(decompose_codepoints(:compatability, codepoints))
293: when :kc
294: compose_codepoints(reorder_characters(decompose_codepoints(:compatability, codepoints)))
295: else
296: raise ArgumentError, "#{form} is not a valid normalization variant", caller
297: end.pack('U*')
298: end
Re-order codepoints so the string becomes canonical.
# File lib/active_support/multibyte/unicode.rb, line 129
129: def reorder_characters(codepoints)
130: length = codepoints.length- 1
131: pos = 0
132: while pos < length do
133: cp1, cp2 = database.codepoints[codepoints[pos]], database.codepoints[codepoints[pos+1]]
134: if (cp1.combining_class > cp2.combining_class) && (cp2.combining_class > 0)
135: codepoints[pos..pos+1] = cp2.code, cp1.code
136: pos += (pos > 0 ? 1 : 1)
137: else
138: pos += 1
139: end
140: end
141: codepoints
142: end
Replaces all ISO-8859-1 or CP1252 characters by their UTF-8 equivalent resulting in a valid UTF-8 string.
Passing true will forcibly tidy all bytes, assuming that the string’s encoding is entirely CP1252 or ISO-8859-1.
# File lib/active_support/multibyte/unicode.rb, line 227
227: def tidy_bytes(string, force = false)
228: if force
229: return string.unpack("C*").map do |b|
230: tidy_byte(b)
231: end.flatten.compact.pack("C*").unpack("U*").pack("U*")
232: end
233:
234: bytes = string.unpack("C*")
235: conts_expected = 0
236: last_lead = 0
237:
238: bytes.each_index do |i|
239:
240: byte = bytes[i]
241: is_cont = byte > 127 && byte < 192
242: is_lead = byte > 191 && byte < 245
243: is_unused = byte > 240
244: is_restricted = byte > 244
245:
246: # Impossible or highly unlikely byte? Clean it.
247: if is_unused || is_restricted
248: bytes[i] = tidy_byte(byte)
249: elsif is_cont
250: # Not expecting contination byte? Clean up. Otherwise, now expect one less.
251: conts_expected == 0 ? bytes[i] = tidy_byte(byte) : conts_expected -= 1
252: else
253: if conts_expected > 0
254: # Expected continuation, but got ASCII or leading? Clean backwards up to
255: # the leading byte.
256: (1..(i - last_lead)).each {|j| bytes[i - j] = tidy_byte(bytes[i - j])}
257: conts_expected = 0
258: end
259: if is_lead
260: # Final byte is leading? Clean it.
261: if i == bytes.length - 1
262: bytes[i] = tidy_byte(bytes.last)
263: else
264: # Valid leading byte? Expect continuations determined by position of
265: # first zero bit, with max of 3.
266: conts_expected = byte < 224 ? 1 : byte < 240 ? 2 : 3
267: last_lead = i
268: end
269: end
270: end
271: end
272: bytes.empty? ? "" : bytes.flatten.compact.pack("C*").unpack("U*").pack("U*")
273: end
Unpack the string at codepoints boundaries. Raises an EncodingError when the encoding of the string isn’t valid UTF-8.
Example:
Unicode.u_unpack('Café') # => [67, 97, 102, 233]
# File lib/active_support/multibyte/unicode.rb, line 68
68: def u_unpack(string)
69: begin
70: string.unpack 'U*'
71: rescue ArgumentError
72: raise EncodingError, 'malformed UTF-8 character'
73: end
74: end
Disabled; run with --debug to generate this.
Generated with the Darkfish Rdoc Generator 1.1.6.