For reference, this is from the same developer [1] that created Semantic MediaWiki [2] and lead the development of Wikidata [3]. Here's a link to the white paper [4] describing Abstract Wikipedia (and Wikilambda). Considering the success of Wikidata, I'm hopeful this effort succeeds, but it is pretty ambitious.
Considering the close relationship with Google and Wikimedia https://en.wikipedia.org/wiki/Google_and_Wikipedia and the considerable money Google gives them, how can one not see this project as "crowdsourcing better training data-sets for Google?"
I dont think the relationship is that close - all it says is google donated a chunk of money in 2010 and in 2019, it was a large chunk of money(~3% of donations) but not like so much to make a dependency.
> Can the data be licensed as GPL-3 or similar?
Pretty unlikely tbh. I dont know if anything is decided for licensing, but if it is to be a "copyleft" license it would be cc-by-sa (like wikipedia) since this is not a program.
Keep in mind that in the united states, an abstract list of facts cannot be copyrighted afaik (i dont think this qualifies as that, wikidata might though)
How so? Wikimedia-provided data can be used by anyone. Google could have kept using and building on their Freebase dataset had they wanted to - other actors in the industry don't have it nearly as easy.
Denny seems to be leaving Google and joining Wikimedia Foundation to lead the project this month, so probably you do not need to worry too much about Denny's affiliation with Google.
As a long-time Wikipedian, this track record is actually worrisome.
Semantic Mediawiki (which I attempted to use at one point) is difficult to work with and far too complicated and abstract for the average Wiki editor. (See also Tim Berners-Lee and the failure of Semantic Web.)
WikiData is a seemingly genius concept -- turn all those boxes of data into a queryable database! -- kneecapped by academic but impractical technology choices (RDF/SPARQL). If they had just dumped the data into a relational database queryable by SQL, it would be far more accessible to developers and data scientists.
> WikiData is a seemingly genius concept -- turn all those boxes of data into a queryable database! -- kneecapped by academic but impractical technology choices (RDF/SPARQL). If they had just dumped the data into a relational database queryable by SQL, it would be far more accessible to developers and data scientists.
Note that the internal data format used by Wikidata is _not_ RDF triples [0], and it's also highly non-relational, since every statement can be annotated by a set of property-value pairs; the full data set is available as a JSON dump. The RDF export (there's actually two, I'm referring to the full dump here) maps this to RDF by reifying statements as RDF nodes; if you wanted to end up with something queryable by SQL, you would also need to resort to reification – but then SPARQL is still the better choice of query language since it allows you to easily do path queries, whereas WITH RECURSIVE at the very least makes your SQL queries quite clunky.
How do you dump general purpose, encyclopedic data into a relational database? What database schema would you use? The whole point of "triples" as a data format is that they're extremely general and extensible.
Now you need a graph for representing connections between pages, but as long as the format is consistent (as they are in templates/infoboxes) that can be done with foreign keys.
Table capital
ID,Name
123,Foo
456,Bar
Table country
Name,Capital_id,Population
Aland,123,100
Bland,456,200
> Most structured data in Wikipedia articles is in either infoboxes or tables
Most of the data in Wikidata does not end up in either Infoboxes or Tables in some Wikipedia, however, and, e.g., graph-like data such as family trees works quite poorly as a relational database; even if you don't consider qualifiers at all.
Those infoboxes get edited all the time to add new data, change data formats, etc. With a relational db, every single such edit would be a schema change. And you would have to somehow keep old schemas around for the wiki history. A triple-based format is a lot more general than that.
[1] https://meta.wikimedia.org/wiki/User:Denny
[2] https://en.wikipedia.org/wiki/Semantic_MediaWiki
[3] https://en.wikipedia.org/wiki/Wikidata
[4] https://arxiv.org/abs/2004.04733