How to sanitize user input for markdown?

We have a database field of document titles. Basically any unformatted valid unicode text can be a title. I’d like to present these in the app as clickable links, presumably using markdown. How to escape / sanitize them? I’m aware that this is not strictly a Streamlit question, but it seems to be an important part of Streamlit best practices, and my first few google queries did not help.

1 Like

Hi @danielvarga,

There are two parts to your task: the unicode text handling and the URL encoding. Let me start with the second part and we’ll work backwards to making sure special unicode characters don’t break things.

Using what’s included with Python only, there’s urllib to encode strings for use in URLs. See this page for a well described discussion of urllib in both Python2 and Python3, each of which organize this functionality differently.

Note also the quote_plus() function, which converts spaces to “+”; this may or may not be something you need.

In terms of good unicode handling, Python3 handles unicode natively, while Python2 needs some encouragement. :slight_smile:

So in Python3 your script may look something like this:

from urllib.parse   import quote
from urllib.request import urlopen
import streamlit as st

title = "άλογο"
url = "https://en.wiktionary.org/wiki/" + quote(title)
st.markdown(f"[{title}]({url})")

While in Python 2 you may have something like this:

# encoding: utf-8
from urllib  import quote
from urllib2 import urlopen

title = "άλογο"
url = "https://en.wiktionary.org/wiki/" + quote(title)
st.markdown("[{title}]({url})".format(title=title, url=url))

I hope this answers your question! Unicode can be a minefield, especially in Python2. Let us know if we can be of more help.

Dear @nthmost,

I really appreciate the detailed heads-up, but my question was more about escaping as opposed to encoding. I’d like to switch off all markdown rendering for user data, because the user data is not valid markdown. Underscores and asterisks are a bigger concern here than non-ASCII characters. I’ve changed the title text in your python 3 example for a test:

from urllib.parse import quote
from urllib.request import urlopen
import streamlit as st

title = "]() Closed the markdown link. **Danger here:** [_markdown injection_](http://dangerous.com). [Opened it again"
url = "https://en.wiktionary.org/wiki/" + quote(title)
st.markdown(f"[{title}]({url})")

Oh, I see! Thanks for clarifying.

At the moment there isn’t a good way to display clickable links without pushing text through the Markdown processor, so I see your problem.

I think your best solution then is to do some data cleaning on each title to get Markdown special-character escapes done. Something like:

cleaned_titles = []
MD_SPECIAL_CHARS = ["\`*_{}[]#+-."]
for title in titles:
    for char in MD_SPECIAL_CHARS:
        if char in MD_SPECIAL_CHARS:
            title = title.replace(char, "\"+char)
    cleaned_titles.append(title)

(If you need to make this loop more efficient, you might use re.sub from the regular expression libary instead.)

Does this seem like it addresses the problem? Let us know!

(edit: fixed typo in my code snippet)

1 Like

Thank you! Yes, this addresses my problem head-on! I don’t know enough markdown to be sure that there are no edge cases that this solution does not cover, but it looks good, and at the very least, it deals with all the issues I’ve encountered so far. I won’t bother with regexes, there won’t be too many titles rendered on a single page. For future reference, here is how it looks like in my code currently, after fixing some typos, and adding three more special characters:

def escape_markdown(text):
    MD_SPECIAL_CHARS = "\`*_{}[]()#+-.!"
    for char in MD_SPECIAL_CHARS:
        text = text.replace(char, "\\"+char)
    return text

Thanks a lot again!

1 Like

Hey Daniel,

I just looked at my reply and noticed a typo in the code I wrote for you, so I’m glad you figured it out. :blush:

I think this code will get you probably about 95% of the way to total coverage, if not 99-100%. There are almost always edge cases when it comes to doing data cleaning on the fly. Let us know what you dig up – as you pointed out from the beginning, this thread will be useful to others.
:beers:

1 Like