Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Welcome to Codidact Meta!

Codidact Meta is the meta-discussion site for the Codidact community network and the Codidact software. Whether you have bug reports or feature requests, support questions or rule discussions that touch the whole network – this is the site for you.

Comments on We should delete all old imported content.

Post

We should delete all old imported content.

+8
−4

Early for the Codidact sites, it seemed like importing content from SE might be a good way to get a new site going quickly. Now with the clarity of hindsight, we can see that this didn't work. The few sites that did mass-import content are doing very poorly. As far as I can tell, these are Writing, Outdoors, and Scientific Speculation. These are three of the four least-active sites we have. There is a strong correlation between sites that imported content, and sites that are doing poorly.

OK, so importing content doesn't work, and we're not doing that anymore. However, the imported content is still hurting us. Even today, it has negative value.

It seems that search engines, particularly Google, are penalizing us for having lots of duplicate content.

I did some tests searching for titles of post by copying them verbatim into Google and Bing search bars. I tried to pick questions with reasonably generic titles so that there would be lots of content out there on the web for the search engines to match against. I considered a site as "not listed" by a search engine if there was no reference to it in the first two pages of search results. Here is what I found:

Search for title of imported post

"Is it safe to carry propane gas cylinder in minivan?" from Outdoors

    Google: Stack Exchange #1, Codidact not listed.

    Bing: Stack Exchange #1, Codidact not listed.

Search for title of home-grown post from site with imported content

"In the United States, where would I find how a geological feature got its name?" from Outdoors

    Google: Codidact not listed.

    Bing: Codidact not listed.

Search for title of home-grown post from a non-import site

"Is ESD Overhyped" from Electrical Engineering

    Google: Codidact #3.

    Bing: Codidact #1.

Conclusion

The imported content isn't being presented by search engines, so it does us no good. But even worse, it seems to cause search engines to "black list" us so that non-imported content isn't shown either.

Some of this damaging effect seems to be spilling over between Codidact sites, particularly on Google. In other words, the whole codidact.com domain is at least somewhat effected because some sites have duplicate content. Note that in the last test case, Google showed two looser matches ahead of the exact match on Codidact.

I therefore propose that we mass-delete all imported content. We may lose a few stray locally-grown answers to old questions, but those are very minor in the scheme of things. Those few answers effectively don't exist according to the search engines anyway. Meanwhile, old imported content is hurting all Codidact sites, not just the ones with the imported content. This is therefore a Codidact-wide issue, and should be dealt with as such. It's time to get past the sunk cost fallacy and cut our losses.


Data about limited imports

Mithrandir pointed out in a comment:

selectively importing posts seems to have worked quite well on the one site [Judaism] that did so

I did similar tests as above from the Judaism site, copying the title of questions directly into Google and Bing search bars. I tried to pick generic-looking questions so that there would be other stuff out there for the search engines to find, but I may have gotten this wrong because I don't know that much about Judaism. Anyway, here are the results:

Search for title of imported post

Why is Tzaara'as considered a Sakana?

    Google: Stack Exchange #1, Codidact #5.

    Bing: Stack Exchange #1, Codidact #2.

Experience-based advice for focusing and slowing down prayers?

    Google: Stack Exchange #1, Codidact #2.

    Bing: Stack Exchange #1, Codidact #14.

Search for title of home-grown post

Are flowers muktzah on Shabbat?

    Google: Codidact not listed.

    Bing: Codidact #21.

What are the flaws in the ten kal vachomer arguments in the torah?

    Google: Codidact #1.

    Bing: Codidact #2.

Updated Conclusions

  1. A small number of imports doesn't seem to hurt a site as much as mass-imports.

  2. The search engine ranking still isn't "great" for home-grown posts. It is hard to say whether that is due to the limited duplicates on the particular Codidact site, or the many duplicates in the whole codidact.com domain.

  3. There is a clear case for deleting all mass-imported content that hasn't been touched. These posts are absolutely hurting the sites they are in, and they are also hurting everything in the codidact.com domain to a some extent. All these posts must be deleted Codidact-wide. It's not just up to the individual sites because everyone is getting hurt.

  4. There is no clear case at this time for deleting the small number of selectively-imported posts, or those that have been modified here. This should be re-evaluated once the mass-imported posts have been deleted and some time (a month, maybe?) has passed to let the search engines adjust to the new conditions. Of course if individual sites wish to delete all their imported content, then this should be supported.

Are you testing in an incognito browser?

No, I didn't think of that. I'm not sure that matters, though. I thought those modes don't so much hide who you are, but delete all temporary stored data (like cookies) when you end the session. I don't see how deleting cookies after the search should matter.

Maybe someone who actually understands this stuff (unlike me), can weigh in here.

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.
Why should this post be closed?

1 comment thread

General comments (7 comments)
General comments
Mithrandir24601‭ wrote almost 4 years ago

As far as I'm personally concerned, I agree that mass-imports was a bad idea and didn't help. I would like to caveat me agreeing with you here that mass-imports are the problem and selectively importing posts seems to have worked quite well on the one site that did so. I'm not really involved in Outdoors at the minute, but I personally would like to see this happen on both Writing and Scientific Speculation

Olin Lathrop‭ wrote almost 4 years ago

@Mith: What sites did selective imports?

Mithrandir24601‭ wrote almost 4 years ago

@OlinLathrop I know Judaism did, although it was maybe only ~20 questions

Olin Lathrop‭ wrote almost 4 years ago

@Mith: It would be interesting to try the first two experiments on the Judaism site to see how much limited imports effect things. Someone that knows the topic should do that so that they can pick generic questions with lots of stuff already out there on the web.

Olin Lathrop‭ wrote almost 4 years ago

@Mith: See update to the question.

Monica Cellio‭ wrote almost 4 years ago

Are you testing in an incognito browser? I just searched Google in such for "problem collaborative work different voices"; the first hit was a scraper, the second SE, and the third Codidact. (Interestingly, the scraper scraped the question but not the answers, so it's actually less useful than the others, but they must know how to do SEO.) In DuckDuckGo, neither appears on the first page of results. I'd like to figure out a good testing methodology so we know what we're measuring.

Olin Lathrop‭ wrote almost 4 years ago

@Monica: See addition to the question.