Google not indexing streamlit site because setting "canonical" to https://awesome-streamlit.org/

Hi,

I registered my website on the Google Search Control.

Google indexed my site but does not add it to the index because it consider it a copy of https://awesome-streamlit.org/ that it sets as the canonical version.

I guess I should set a “user defined canonical” but I have no idea how to do it. My site runs on nginx.

I have added a sitemap but that does not help

I am not the only one to have this issue https://support.google.com/webmasters/thread/45113463?hl=en

Many thanks

Fabio

Annotation 2020-07-25 122331

Hi Fabio,

It is something you should be able to override in your CMS settings.

Have you got access to it?

Thanks,
Charly

Hi Charly

I understand it is a part of nginx. In that case yes. I have already changed the config file to let google access the site map file. Is it a similar process?

Best

Fabio

Hi Charly,

I googled around but could not find anything to fix my problem. They talk of internal canonical pages (many pages with same content in a given website).

The issue is that google thinks that my site is a copy of the awesome streamlit site.

It is a google algorithm that guards against copycat sites. Since google thinks my site is a copy, google does not add my ‘copycat’ site to the index. I do not think I can simply add a parameter somewhere saying “I am not a copycat“ because in that case all the copycat sites would do it and that would defy the aim of google.

The other website that had the same problem (link above) did not solve it. What they did was make a separate standard site that google crawls and put a link on that site to the streamlit app. Not nice.

I wonder if this is a common issue or it happens only if the streamlit app has some particular characteristics that make google think that it is equivalent to the awesome streamlit site.

My markdown content has nothing in common and the code has nothing in common.

So, by exclusion, is it that my script is relatively long (6000 lines)? So if you want to be indexed keep you code under lines?

Thanks

Fabio

To help Fabio, @randyzwitch can canonicalisation rules be amended within a Streamlit app?

I doubt it yet thought I would ask :slight_smile:

I started an internal discussion about this topic, so we’ll see what the thought is when everyone gets back to work this week

1 Like

Thanks Randy!

@Fabio: you should be able to override canonicals if you’re on CDN services like Cloudflare.

What’s your CMS?

Charly

Hi Charly,

I have a dedicated server. I do not use any CDN service like Cloudflare (expect if Hetzner the hosting company I get my server from uses such services).

I use nginx. I understand that nginx has a CMS function. I can share my nginx configuration if it helps.

I am new to this so I am not sure what you mean by CMS, but other than nginx I have nothing installed on the server.

Does this help?

Many thanks

Fabio

Looks like there are some links out there for this:

Hi Randy,

I will try.

However it is not the same use case.

My problem (and these people^s unsolved problem as well https://support.google.com/webmasters/thread/45113463?hl=en ) is that Google thinks my website is a copy of an external website. The medium thing is about three copies of the same stuff in a give website.

Google crawls my website and says

“Oh this looks really like a copy of awesome-streamlit.org. So I will not put it in my index because I do not want Randy to end up on a copycat site.”. Basically what streamlit “exposes” to the Google spider is the same for my site and for awesome-streamlit even though the content is completely different.

But since Google has invested in steamlit. I have hope :smile:

Cheers and many thanks for your help

Fabio

Right, but those posts show how to declare the canonical link. From your first post, it appears that because there is no user-declared canonical link (i.e. None), then Google selects it to be awesome-streamlit. If you set the canonical value using nginx, I would hope Google would respect that.

Good point. I will try and let you know.

Many thanks

Fabio

1 Like

Hi Randy

I inserted this line

location / {
add_header Link: “<$scheme://$http_host/testspacevariance.c>; rel=\”canonical\””;

to my location nginx configuration (I hope it can help others) and I saw some significant improvement.

It is still not sure that it is over (" URL will be indexed only if certain conditions are met" but there is hope.

Thanks!

Fabio

2 Likes

Randy, Charly,

Google in its infinite wisdom has decided that my website is clone of awesome-streamlit.

It indexed the site today and said no.

Kind regards

Fabio

Wow, that’s unfortunate.

If it is just my site it is OK.

@randyzwitch @andfanilo Self-referential canonicals are must-haves for Streamlit apps, allowing them to be indexed/discoverable organically in Google.

Thanks,
Charly

Charly,

I have a question. Is this something that happens only to me? I was trying to understand what google is looking at to dedide my page is copycat. If it happens only to me with 250000 streamlit application out there it should be fairly easy to understand what pisses Google off…

Otherwise either most of these 250000 are not interested in Google indexing or they do it by putting the content in some site from where the link to the streamlit app.

Many thanks

Fabio

1 Like
1 Like

Cool!