Valid content filter, language detect on min chars, fetch missingkids.org

This commit is contained in:
Luciano Gervasoni
2025-04-03 09:44:46 +02:00
parent 3b54e247e7
commit 5addfa5ba9
18 changed files with 533 additions and 66 deletions

View File

@@ -1 +1,33 @@
# Matitos
# Matitos
- Scheduled tasks
- Fetcher -> Inserts raw URLs
- Fetch parsing URL host
- Fetch from RSS feed
- Fetch searching (Google search & news, DuckDuckGo, ...)
- Process URLs -> Updates raw URLs
- Extracts title, description, content, image and video URLs, main image URL, language, keywords, authors, tags, published date
- Determines if it is a valid article content
- Valid URLs
- Generate summary
- Classification
- 5W: Who, What, When, Where, Why of a Story
- Related to child abuse?
- ...
Georgia Institute of Technology
https://comm.gatech.edu resources writers
- Visualization of URLs
- Filter URLs
- By status, search, source, language
- Charts
- Content generation
- Select URLs:
- Valid content
- language=en
- published_date during last_week
- Use classifications
- Merge summaries, ...