Due to the nature of the default robots.txt and the meta tags in Lemmy, search engines will index even non-local communities. This leads to results that are undesirable, such as unrelated/undesirable content being associated with your instance.

As of today, lemmy-ui does not allow hiding non-local (or any) communities from Google and other search engines. If you, like me, do not want your instance to be associated with other content, you can add a custom robots.txt and response headers to avoid indexing.

In nginx, simply add this:

# Disallow all search engines
location / {
  ...
  add_header X-Robots-Tag noindex;
}

location = /robots.txt {
    add_header Content-Type text/plain;
    return 200 "User-agent: *\nDisallow: /\n";
}

Here’s a commit in my fork of the lemmy-ansible playbook. And here’s a corresponding issue I opened in lemmy-ui.

I hope this helps someone :-)

  • parmesancrabs
    link
    fedilink
    English
    arrow-up
    3
    ·
    edit-2
    1 year ago

    Would it be a better idea to exclude any URLs that are similar to /c/*@*.* I think that would block external communities but keep local ones still indexable in their native locations.

    • parmesancrabs
      link
      fedilink
      English
      arrow-up
      4
      ·
      1 year ago

      Or maybe the lemmy source code should include a canonical tag to the original host’s post?

      • binwiederhier@discuss.ntfy.shOP
        link
        fedilink
        English
        arrow-up
        2
        ·
        1 year ago

        I think lemmy-ui should add a way to exclude bon-locla communities from indexing and/or add a canonical tag like you suggested. I like that idea.