Issue 18609
sanitize data from publishers

18609
Reporter: hoefft
Assignee: cgendreau
Type: Improvement
Summary: sanitize data from publishers
Priority: Unassessed
Resolution: Fixed
Status: Closed
Created: 2016-06-24 11:39:44.265
Updated: 2017-10-10 16:05:18.695
Resolved: 2017-10-10 16:05:18.678
        
Description: For individual fields in the api response it would be nice to know what you could expect. Example
A dataset description could currently look like this:

{code:xml}
Hello
\n\r

Headline



{code}

An odd mixture allowing for XSS and difficult to style.

I’m just one consumer of the API with a specific agenda.

I believe html is useful. So here is my prefs as of now.

From a purely visual perspective I would prefer only allowing italics and links. links because they are useful. italics as they appear in species names and aren’t invasive visually.

Bold is visually invasive and so is headlines etc and could conflict with our document layout. I would prefer not to have lists and tables as well. And it doesn’t seem to be used a lot. Species lists and tabular data seem to belong in other more structured fields. Block elements should be stripped and followed by a break tag. Paragraphs are currently concatenated with \n newline char. that too should be </br> break tag.

That would also protect us against XSS which we are currently vulnerable to.]]></description>
    </Issue>
</pre>
<hr/>
<pre>

Author: hoefft
Created: 2016-06-24 11:49:38.569
Updated: 2016-09-22 10:32:11.306
        
I would imagine something along the lines of
{code}
Hello
{code}
becomes
{code}
Hello
{code}

{code}
Hello\nyou
{code}
becomes
{code}
Hello<br>you
{code}

{code}
<h1 style="color:redheadline</h1>
<p>paragraph with <a href="link-to-somewherelink</a> and <em>italics</em></p>
<script>//Steal data</script>
<iframe src="evil.com
{code}
becomes
{code}
headline<br>paragraph with <a href="link-to-somewherelink</a> and <em>italics</em>
{code}</body>
    </Action>
</pre><hr/>
<pre>

Author: kbraak@gbif.org
Created: 2016-06-24 13:58:48.597
Updated: 2016-06-24 13:58:48.597
        
Thank you Morten, this policy looks really nice.

[~cgendreau] our long term aim should be to properly sanitize all entities (e.g. Organisations, Installations, etc.) and persist them in sanitized form in the database.

Short term, we could try to just sanitize the Dataset description, which is the need that prompted this issue.

To do so, we could investigate an annotation on the Dataset.description field that's used by our custom sanitizer prior to every create or update of a Dataset. We use annotations in a similar manner to perform field-level validation.

I agree it's best just to silently sanitize the fields without rejecting the create/update request.

Anyways, if we went this route, it would be good to include this API change while working on POR-2562, which is related to bringing the API up to date to support the latest version of the GBIF metadata profile (v1.1).

Looking forward to your feedback, thanks.





</body>
    </Action>
</pre><hr/>
<pre>

Author: kbraak@gbif.org
Created: 2016-06-24 14:05:49.237
Updated: 2016-06-24 14:05:49.237
        
[~cgendreau] to help write our custom sanitzer we could leverage https://jsoup.org/ - a Java library for working with real-world HTML with a convenient API for extracting and manipulating data, and the ability to clean user-submitted content against a safe white-list, to prevent XSS attacks.

This library is currently in use by [gbif-metadata-profile|https://github.com/gbif/gbif-metadata-profile] to extract the machine-readible license title and URL inside the EML intellectualRights element.

Please let me know what you think about this library, or if you are aware of other better alternatives. Thanks.</body>
    </Action>
</pre><hr/>
<pre>

Author: cgendreau
Comment: I&apos;m planning to use OWASP library https://github.com/OWASP/java-html-sanitizer
Created: 2016-06-24 15:22:50.199
Updated: 2016-06-24 15:22:50.199
</pre><hr/>
<pre>

Author: hoefft
Comment: [~jlegind@gbif.org] what is your take on this? At some point I remember you mentioned the importance of lists in dataset descriptions. 
Created: 2016-07-01 07:53:01.075
Updated: 2016-07-01 07:53:01.075
</pre><hr/>
<pre>

Author: mdoering@gbif.org
Created: 2016-07-01 10:13:38.708
Updated: 2016-07-01 10:13:58.361
        
A very good issue and suggestion indeed!

In case you are not aware, our EML parsing generates the new line breaks which we back then preferred over the </br> tag. EML allows to use a rather rich document language incl paragraphs and lists, but different from html, that we need to handle/translate: http://www.hubbardbrook.org/eml/eml-2.0.0/docs/eml-2.0.0/eml-text.html#TextType

</body>
    </Action>
</pre><hr/>
<pre>

Author: jlegind@gbif.org
Created: 2016-07-04 11:12:23.35
Updated: 2016-07-04 11:12:23.35
        
[~hoefft]  We see numbered lists or "bullet point like" lists included that the publisher intents to have displayed in that way.
Certainly this goes for publisher descriptions as well. Here there can be bullet point characters copy pasted into the description string and it would give a much nicer look if there was a mechanism in place for interpreting this.
(This is nice-to-have rather than need-to-have)</body>
    </Action>
</pre><hr/>
<pre>

Author: cgendreau
Created: 2016-07-11 11:36:56.415
Updated: 2016-07-11 11:37:02.69
        
Step 1:
https://github.com/gbif/gbif-common/commit/7217f48d44d1cc7373ee6790c85b76edaced42fa</body>
    </Action>
</pre><hr/>
<pre>

Author: hoefft
Created: 2016-11-28 10:48:17.189
Updated: 2016-11-28 10:48:17.189
        
I'm starting to think that it is a better solution to only strip iframes and scripts and inline styling.
And then leaving it to the consumer to remove headlines if it isn't desirable for the presentation at hand.

At least the suggestion I gave in the beginning included removing paragraph tags. I have since regretted that suggestion as break tags are a pain to style.

Further we already sanitise the Drupal content as that API is often loaded with empty p-tags - so using the same filters on Datasets etc is trivial</body>
    </Action>
</pre><hr/>
<pre>

Author: cgendreau
Created: 2016-11-28 14:24:36.642
Updated: 2016-11-28 14:24:36.642
        
From this commit:
https://github.com/gbif/registry/commit/42294157d10fc8de27ed002da031ed14f674da14

We allow paragraph tags and others</body>
    </Action>
</pre><hr/>
</body>
</html>