18609
Reporter: hoefft
Assignee: cgendreau
Type: Improvement
Summary: sanitize data from publishers
Priority: Unassessed
Resolution: Fixed
Status: Closed
Created: 2016-06-24 11:39:44.265
Updated: 2017-10-10 16:05:18.695
Resolved: 2017-10-10 16:05:18.678
Description: For individual fields in the api response it would be nice to know what you could expect. Example
A dataset description could currently look like this:
{code:xml}
Hello
\n\r
Headline
Author: hoefft
Created: 2016-06-24 11:49:38.569
Updated: 2016-09-22 10:32:11.306
I would imagine something along the lines of
{code}
Hello
{code}
becomes
{code}
Hello
{code}
{code}
Hello\nyou
{code}
becomes
{code}
Hello you
{code}
{code}
and italics
Author: kbraak@gbif.org
Created: 2016-06-24 13:58:48.597
Updated: 2016-06-24 13:58:48.597
Thank you Morten, this policy looks really nice.
[~cgendreau] our long term aim should be to properly sanitize all entities (e.g. Organisations, Installations, etc.) and persist them in sanitized form in the database.
Short term, we could try to just sanitize the Dataset description, which is the need that prompted this issue.
To do so, we could investigate an annotation on the Dataset.description field that's used by our custom sanitizer prior to every create or update of a Dataset. We use annotations in a similar manner to perform field-level validation.
I agree it's best just to silently sanitize the fields without rejecting the create/update request.
Anyways, if we went this route, it would be good to include this API change while working on POR-2562, which is related to bringing the API up to date to support the latest version of the GBIF metadata profile (v1.1).
Looking forward to your feedback, thanks.
Author: kbraak@gbif.org
Created: 2016-06-24 14:05:49.237
Updated: 2016-06-24 14:05:49.237
[~cgendreau] to help write our custom sanitzer we could leverage https://jsoup.org/ - a Java library for working with real-world HTML with a convenient API for extracting and manipulating data, and the ability to clean user-submitted content against a safe white-list, to prevent XSS attacks.
This library is currently in use by [gbif-metadata-profile|https://github.com/gbif/gbif-metadata-profile] to extract the machine-readible license title and URL inside the EML intellectualRights element.
Please let me know what you think about this library, or if you are aware of other better alternatives. Thanks.
Author: cgendreau
Comment: I'm planning to use OWASP library https://github.com/OWASP/java-html-sanitizer
Created: 2016-06-24 15:22:50.199
Updated: 2016-06-24 15:22:50.199
Author: hoefft
Comment: [~jlegind@gbif.org] what is your take on this? At some point I remember you mentioned the importance of lists in dataset descriptions.
Created: 2016-07-01 07:53:01.075
Updated: 2016-07-01 07:53:01.075
Author: mdoering@gbif.org
Created: 2016-07-01 10:13:38.708
Updated: 2016-07-01 10:13:58.361
A very good issue and suggestion indeed!
In case you are not aware, our EML parsing generates the new line breaks which we back then preferred over the tag. EML allows to use a rather rich document language incl paragraphs and lists, but different from html, that we need to handle/translate: http://www.hubbardbrook.org/eml/eml-2.0.0/docs/eml-2.0.0/eml-text.html#TextType
Author: jlegind@gbif.org
Created: 2016-07-04 11:12:23.35
Updated: 2016-07-04 11:12:23.35
[~hoefft] We see numbered lists or "bullet point like" lists included that the publisher intents to have displayed in that way.
Certainly this goes for publisher descriptions as well. Here there can be bullet point characters copy pasted into the description string and it would give a much nicer look if there was a mechanism in place for interpreting this.
(This is nice-to-have rather than need-to-have)
Author: hoefft
Created: 2016-11-28 10:48:17.189
Updated: 2016-11-28 10:48:17.189
I'm starting to think that it is a better solution to only strip iframes and scripts and inline styling.
And then leaving it to the consumer to remove headlines if it isn't desirable for the presentation at hand.
At least the suggestion I gave in the beginning included removing paragraph tags. I have since regretted that suggestion as break tags are a pain to style.
Further we already sanitise the Drupal content as that API is often loaded with empty p-tags - so using the same filters on Datasets etc is trivial
Author: cgendreau
Created: 2016-11-28 14:24:36.642
Updated: 2016-11-28 14:24:36.642
From this commit:
https://github.com/gbif/registry/commit/42294157d10fc8de27ed002da031ed14f674da14
We allow paragraph tags and others