Language learning and linguistics nerds unite

@afellowkid@lemmygrad.ml · 2 years ago

Language learning and linguistics nerds unite

@holdengreen@lemmygrad.ml · 2 years ago

language acquisition

Yes I had been working on Python scripts to help me do this (on github). Would be useful if knowledgeable people helped to make specialized corpus’ and datasets specifically for learning languages.

Like I was trying to create a tool that would create language lessons from an arbitrary ebook.

@afellowkid@lemmygrad.ml · 2 years ago

Sounds interesting. Do you mean it would take any prose/text and then comb its vocab and grammar forms and put what it finds into a lesson format? I feel like I might be misunderstanding this.

@holdengreen@lemmygrad.ml · 2 years ago

here is my last work on it actually: https://github.com/holdengreen/lingtool

I was looking at using Mozilla’s TTS and it’s Chinese fork for TTS. I was really hoping to get an Android app out in short order… Generating usable audio files in a container is first priority.

@afellowkid@lemmygrad.ml · 2 years ago

I see! I really have no programming knowledge at all but I like the idea of this

@holdengreen@lemmygrad.ml · 2 years ago

If you know other languages good enough like Spanish or Chinese or any other major language that is useful.

It’s hard to make the code good enough to accurately break down individual sentences on its own.

Arsen6331 ☭ · 2 years ago

It’s hard to make the code good enough to accurately break down individual sentences on its own.

Yeah, that would be the problem with such a solution. It’s hard to analyze human language using a machine.

@holdengreen@lemmygrad.ml · edit-2 2 years ago

I was using Google/Baidu and argos ML models to translate the text in large sections and it can do it fairly accurately because it’s trained on large corpus’s of natural language and it can operate given context.

Trouble is for my application I need to use libraries like THULAC to separate the individual parts of speech in a sentence and then hopefully translate the nouns and verbs that need to be translated for the user automatically. But that is something the existing models are just not built for. (or maybe they can be retrofitted/adapted?) So I may need new models and combine that with some hand coded logic to get the best automatic result. Or the app/tool will just need to rely on more manual methods meaning you might need to do it yourself upfront.

@holdengreen@lemmygrad.ml · 2 years ago

Yes that and parallel translation so it can create an audiobool that helps you understand the vocabulary you don’t know.

@redtea@lemmygrad.ml · 2 years ago

Do you or other techy comrades know how to do the following…

I’d like to read some García Márquez in the original, but it’s a bit too advanced for me. I was thinking of front-loading the vocab and learning the words that are in the book(s) before starting. The only way I can think of doing this is intensively reading it/them and listing all the words I don’t know. The problem is that I’ll end up understanding enough of the story to spoil it, but not understand enough to enjoy it.

So what I’d like to do is list all the words used in the text by frequency. Then I can ignore the words I do know and learn a bulk of the less frequent words before starting to read the book. Is there a script or something that I could apply to the epub version to strip the words and sort them by frequency?

@holdengreen@lemmygrad.ml · 2 years ago

Yeah that’s a class of application of the app/tool that I had in mind. I can write that in Python or find it somewhere if you give me a hot min.

@holdengreen@lemmygrad.ml · 2 years ago

looks like people have lots of ways to do this if you look around. https://cybertext.wordpress.com/2021/06/30/use-calibre-to-get-a-word-frequency-list/ https://www.reddit.com/r/languagelearning/comments/mps9nm/is_there_a_program_that_generates_word_frequency/

@redtea@lemmygrad.ml · 2 years ago

Thank you!

This is exactly what I was looking for.

I’ll try using calibre.

@holdengreen@lemmygrad.ml · 2 years ago

ok… Unix like environments are made to do this type of stuff really well. for example: https://ebooks.stackexchange.com/questions/5841/i-am-looking-for-a-software-or-a-way-to-list-extract-count-in-short-analyze

If you would like I could try to package something more convenient in a container or app or whatever. (which was the point of the project).

@redtea@lemmygrad.ml · 2 years ago

No need to do this just for me, as I’ll try using calibre first.

Just a thought, though…

There might be quite a bit of interest in an app that could (maybe this already exists!):

list all words in a text;
arrange those words by frequency (increasing or decreasing);
automate a translation using e.g. deepl or another translator; and
export those words into Anki cards (with an option to ignore words taken from other lists of the most frequent 100, 200, 500, 1000, 3000, 5000 words, etc*).

IME trying to rote learn the most frequent 1-200 words is a bit pointless as these words are so frequent they seem to have the most meanings and that meaning is heavily reliant on context. Others may be happy to ignore other sets of the most frequent words if they’re already confident with them. Excluding the most frequent words in any particular text could also work, but the most frequent e.g. 1000 words in that single text might actually span the 2-5000 most frequent words of the written language in general. So auto-excluding the n# most frequent words of one book might delete quite a few unknown words.

It looks like I’ll be able to do 1 and 2 with calibre and it looks like 3 might be done easily by saving to .xml and opting the file in Google’s spreadsheet software. And I never really got on with Anki, but I know it’s popular.

Thanks again for the help.

@holdengreen@lemmygrad.ml · 2 years ago

Yes I should be able to do all of that when I’m not out busy. I write on my Manjaro laptop, but the next step is putting the project in a Docker/OCI container which shouldn’t be that hard to run on Windows or whatever.

Mobile app is actually a bit diffult because I’d need to somehow package the entire Python env with interpreter and libraries. Trying to package ML models would be difficult too (requiring multiple GB’s of storage) unless cloud hosting was an option which it probably isn’t.

But that’s what I want to do because it is optimal for on-the-go and for people actually being likely/able to use it. Or the lessons and materials can be statically generated from the container which is the plan initially but limits rich interactivity which may be a good goal.

@nour@lemmygrad.ml · 2 years ago

September’s Korean study thread on c/korea

I just created October thread. :) https://lemmygrad.ml/post/395711

@afellowkid@lemmygrad.ml · 2 years ago

Great!