Object
Various methods for cleaning up HTML and preparing it for safe public consumption.
Documents used for refrence:
allowed html elements.
allowed attributes.
allowed attributes, but they can contain URIs, extra caution required. NOTE: That means this doesnt list all URI attrs, just the ones that are allowed.
Adds entities where possible. Works like CGI.escapeHTML, but will not escape existing entities; i.e. { will NOT become {
This method could be improved by adding a whitelist of html entities.
# File lib/html-cleaner.rb, line 152
152: def add_entities(str)
153: str.to_s.gsub(/\"/, '"').gsub(/>/, '>').gsub(/</, '<').gsub(/&(?!(\#\d+|\#x([0-9a-f]+)|\w{2,8});)/mi, '&')
154: end
Does this:
Unescape HTML
Parse HTML into tree
Find ‘body’ if present, and extract tree inside that tag, otherwise parse whole tree
Each tag:
remove tag if not whitelisted
escape HTML tag contents
remove all attributes not on whitelist
extra-scrub URI attrs; see dodgy_uri?
Extra (i.e. unmatched) ending tags and comments are removed.
# File lib/html-cleaner.rb, line 60
60: def clean(str)
61: str = unescapeHTML(str)
62:
63: doc = Hpricot(str, :fixup_tags => true)
64: doc = subtree(doc, :body)
65:
66: # get all the tags in the document
67: # Somewhere near hpricot 0.4.92 "*" starting to return all elements,
68: # including text nodes instead of just tagged elements.
69: tags = (doc/"*").inject([]) { |m,e| m << e.name if(e.respond_to?(:name) && e.name =~ /^\w+$/) ; m }.uniq
70:
71: # Remove tags that aren't whitelisted.
72: remove_tags!(doc, tags - HTML_ELEMENTS)
73: remaining_tags = tags & HTML_ELEMENTS
74:
75: # Remove attributes that aren't on the whitelist, or are suspicious URLs.
76: (doc/remaining_tags.join(",")).each do |element|
77: next if element.raw_attributes.nil? || element.raw_attributes.empty?
78: element.raw_attributes.reject! do |attr,val|
79: !HTML_ATTRS.include?(attr) || (HTML_URI_ATTRS.include?(attr) && dodgy_uri?(val))
80: end
81:
82: element.raw_attributes = element.raw_attributes.build_hash {|a,v| [a, add_entities(v)]}
83: end unless remaining_tags.empty?
84:
85: doc.traverse_text do |t|
86: t.swap(add_entities(t.to_html))
87: end
88:
89: # Return the tree, without comments. Ugly way of removing comments,
90: # but can't see a way to do this in Hpricot yet.
91: doc.to_s.gsub(/<\!--.*?-->/i, '')
92: end
Returns true if the given string contains a suspicious URL, i.e. a javascript link.
This method rejects javascript, vbscript, livescript, mocha and data URLs. It could be refined to only deny dangerous data URLs, however.
# File lib/html-cleaner.rb, line 117
117: def dodgy_uri?(uri)
118: uri = uri.to_s
119:
120: # special case for poorly-formed entities (missing ';')
121: # if these occur *anywhere* within the string, then throw it out.
122: return true if (uri =~ /&\#(\d+|x[0-9a-f]+)[^;\d]/i)
123:
124: # Try escaping as both HTML or URI encodings, and then trying
125: # each scheme regexp on each
126: [unescapeHTML(uri), CGI.unescape(uri)].each do |unesc_uri|
127: DODGY_URI_SCHEMES.each do |scheme|
128:
129: regexp = "#{scheme}:".gsub(/./) do |char|
130: "([\0000-\0037\1177\s]*)#{char}"
131: end
132:
133: # regexp looks something like
134: # /\A([\000-\037\177\s]*)j([\000-\037\177\s]*)a([\000-\037\177\s]*)v([\000-\037\177\s]*)a([\000-\037\177\s]*)s([\000-\037\177\s]*)c([\000-\037\177\s]*)r([\000-\037\177\s]*)i([\000-\037\177\s]*)p([\000-\037\177\s]*)t([\000-\037\177\s]*):/mi
135: return true if (unesc_uri =~ %{\A#{regexp}}i)
136: end
137: end
138:
139: nil
140: end
For all other feed elements:
Unescape HTML.
Parse HTML into tree (taking ‘body’ as root, if present)
Takes text out of each tag, and escapes HTML.
Returns all text concatenated.
# File lib/html-cleaner.rb, line 99
99: def flatten(str)
100: str.gsub!("\n", " ")
101: str = unescapeHTML(str)
102:
103: doc = Hpricot(str, :xhtml_strict => true)
104: doc = subtree(doc, :body)
105:
106: out = []
107: doc.traverse_text {|t| out << add_entities(t.to_html)}
108:
109: return out.join
110: end
Disabled; run with --debug to generate this.
Generated with the Darkfish Rdoc Generator 1.1.6.