Min num of sources filter, initialization scripts, docker ready to use dev mode

This commit is contained in:
Luciano Gervasoni
2025-04-04 16:56:27 +02:00
parent 76079d7bd0
commit 9127552bfd
10 changed files with 132 additions and 83 deletions

View File

@@ -1,44 +1,46 @@
# Matitos
- Scheduled tasks
- Fetcher -> Inserts raw URLs
- Fetch parsing URL host
- Fetch from RSS feed
- Fetch keyword search (Google search & news, DuckDuckGo, ...)
++ Sources -> Robustness to TooManyRequests block
- Selenium based
- Sites change their logic, request captcha, ...
- Brave Search API
- Free up to X requests per day. Need credit card association (no charges)
- Bing API
- Subscription required
- Yandex. No API?
++ Proxy / VPN?
TooManyRequests, ...
++ Search per locale (nl-NL, fr-FR, en-GB)
- Process URLs -> Updates raw URLs
- Extracts title, description, content, image and video URLs, main image URL, language, keywords, authors, tags, published date
- Determines if it is a valid article content
- URLs Fetcher -> Inserts raw URLs
- Fetch parsing URL host
- Fetch from RSS feed
- Fetch keyword search (Google search & news, DuckDuckGo, ...)
++ Sources -> Robustness to TooManyRequests block
- Selenium based
- Sites change their logic, request captcha, ...
- Brave Search API
- Free up to X requests per day. Need credit card association (no charges)
- Bing API
- Subscription required
- Yandex. No API?
++ Proxy / VPN?
Bypass geoblock
- Valid URLs
- Generate summary
- One paragraph
- At most three paragraphs
- Classification
- 5W: Who, What, When, Where, Why of a Story
- Related to child abuse?
- ...
TooManyRequests, ...
++ Search per locale (nl-NL, fr-FR, en-GB)
- URLs Processing -> Updates raw URLs
- Extracts title, description, content, image and video URLs, main image URL, language, keywords, authors, tags, published date
- Determines if it is a valid article content
++ Proxy / VPN?
Bypass geoblock
- Visualization of URLs
- Filter URLs
- By status, search, source, language
- By status, search, source, language, ...
- Charts
- Valid URLs
- Generate summary
- One paragraph
- At most three paragraphs
- Classification
- 5W: Who, What, When, Where, Why of a Story
- Related to child abuse?
- ...
- Content generation
- Select URLs:
- URLs Selection
- Valid content
- language=en
- published_date during last_week
- Use classifications
- Language of interest
- Published (or fetch) date during last_week
- Fetched by at least N sources
- Use classifications and summaries
- Merge summaries, ...