63 lines
2.0 KiB
Markdown
63 lines
2.0 KiB
Markdown
# Matitos
|
|
|
|
- URLs Fetcher -> Inserts raw URLs
|
|
- Fetch parsing URL host
|
|
- Fetch from RSS feed
|
|
- Fetch keyword search (Google search & news, DuckDuckGo, ...)
|
|
- TODO: More sources -> Robustness to TooManyRequests block
|
|
- Selenium based
|
|
- Sites change their logic, request captcha, ...
|
|
- Brave Search API
|
|
- Free up to X requests per day. Need credit card association (no charges)
|
|
- Bing API
|
|
- Subscription required
|
|
- Yandex. No API?
|
|
- TODO: Proxy / VPN?
|
|
- TooManyRequests, ...
|
|
- TODO: Search per locale (nl-NL, fr-FR, en-GB)
|
|
- Fetch keyword search for selenium sources
|
|
|
|
|
|
- URLs Processing -> Updates raw URLs
|
|
- Extracts title, description, content, image and video URLs, main image URL, language, keywords, authors, tags, published date
|
|
- Determines if it is a valid article content
|
|
- TODO: Proxy / VPN?
|
|
- Bypass geoblock and TooManyRequests
|
|
|
|
- Visualization of URLs
|
|
- Filter URLs
|
|
- By fetch date, status, search, source, language, has valid content, minimum amount of sources, ...
|
|
- Charts
|
|
|
|
- URLs selection
|
|
- Published (or fetch) date during last_week / last 24 hrs
|
|
- Language of interest
|
|
- Valid content
|
|
- Fetched by at least N sources
|
|
- Use classifications and summaries
|
|
- TODO: Manual inspection -> Improve automation
|
|
- Rules or pattern for invalid articles, e.g. "youtube.com/*"
|
|
- URL host with "priority" or "weight"
|
|
|
|
- Content generation
|
|
- Generate summary
|
|
- One paragraph
|
|
- At most three paragraphs
|
|
- Classification
|
|
- 5W: Who, What, When, Where, Why of a Story
|
|
- Related to child abuse?
|
|
- ...
|
|
- Merge similar articles?
|
|
|
|
# Deploy
|
|
|
|
* Dev mode
|
|
```
|
|
docker compose -f docker-compose-dev.yml down -v
|
|
docker compose -f docker-compose-dev.yml up --no-deps --build
|
|
```
|
|
* Prod mode
|
|
```
|
|
docker compose down -v
|
|
docker compose up -d --no-deps --build
|
|
``` |