General search fix, status pattern match regex, find feeds on startup
This commit is contained in:
36
README.md
36
README.md
@@ -4,7 +4,7 @@
|
||||
- Fetch parsing URL host
|
||||
- Fetch from RSS feed
|
||||
- Fetch keyword search (Google search & news, DuckDuckGo, ...)
|
||||
++ Sources -> Robustness to TooManyRequests block
|
||||
- TODO: More sources -> Robustness to TooManyRequests block
|
||||
- Selenium based
|
||||
- Sites change their logic, request captcha, ...
|
||||
- Brave Search API
|
||||
@@ -12,22 +12,32 @@
|
||||
- Bing API
|
||||
- Subscription required
|
||||
- Yandex. No API?
|
||||
++ Proxy / VPN?
|
||||
TooManyRequests, ...
|
||||
++ Search per locale (nl-NL, fr-FR, en-GB)
|
||||
- TODO: Proxy / VPN?
|
||||
- TooManyRequests, ...
|
||||
- TODO: Search per locale (nl-NL, fr-FR, en-GB)
|
||||
|
||||
- URLs Processing -> Updates raw URLs
|
||||
- Extracts title, description, content, image and video URLs, main image URL, language, keywords, authors, tags, published date
|
||||
- Determines if it is a valid article content
|
||||
++ Proxy / VPN?
|
||||
Bypass geoblock
|
||||
- TODO: Proxy / VPN?
|
||||
- Bypass geoblock and TooManyRequests
|
||||
|
||||
- Visualization of URLs
|
||||
- Filter URLs
|
||||
- By status, search, source, language, ...
|
||||
- By fetch date, status, search, source, language, has valid content, minimum amount of sources, ...
|
||||
- Charts
|
||||
|
||||
- Valid URLs
|
||||
- URLs selection
|
||||
- Published (or fetch) date during last_week / last 24 hrs
|
||||
- Language of interest
|
||||
- Valid content
|
||||
- Fetched by at least N sources
|
||||
- Use classifications and summaries
|
||||
- TODO: Manual inspection -> Improve automation
|
||||
- Rules or pattern for invalid articles, e.g. "youtube.com/*"
|
||||
- URL host with "priority" or "weight"
|
||||
|
||||
- Content generation
|
||||
- Generate summary
|
||||
- One paragraph
|
||||
- At most three paragraphs
|
||||
@@ -35,12 +45,4 @@
|
||||
- 5W: Who, What, When, Where, Why of a Story
|
||||
- Related to child abuse?
|
||||
- ...
|
||||
|
||||
- Content generation
|
||||
- URLs Selection
|
||||
- Valid content
|
||||
- Language of interest
|
||||
- Published (or fetch) date during last_week
|
||||
- Fetched by at least N sources
|
||||
- Use classifications and summaries
|
||||
- Merge summaries, ...
|
||||
- Merge similar articles?
|
||||
|
||||
Reference in New Issue
Block a user