Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For reference, this is from the same developer [1] that created Semantic MediaWiki [2] and lead the development of Wikidata [3]. Here's a link to the white paper [4] describing Abstract Wikipedia (and Wikilambda). Considering the success of Wikidata, I'm hopeful this effort succeeds, but it is pretty ambitious.

[1] https://meta.wikimedia.org/wiki/User:Denny

[2] https://en.wikipedia.org/wiki/Semantic_MediaWiki

[3] https://en.wikipedia.org/wiki/Wikidata

[4] https://arxiv.org/abs/2004.04733




Damn. Big kudos to Denny.

And to all the other people doing awesome work but not on the top of HN.


Considering the close relationship with Google and Wikimedia https://en.wikipedia.org/wiki/Google_and_Wikipedia and the considerable money Google gives them, how can one not see this project as "crowdsourcing better training data-sets for Google?"

Can the data be licensed as GPL-3 or similar?


That's an incredibility zero-sum way of looking at the world.

Almost every research group and company doing NLP work uses Wikipedia I'd say it is a fantastic donation by Google which improves science generally.

> Can the data be licensed as GPL-3 or similar?

It's under CC BY-SA and (with a few exceptions) the GNU Free Documentation License.


I dont think the relationship is that close - all it says is google donated a chunk of money in 2010 and in 2019, it was a large chunk of money(~3% of donations) but not like so much to make a dependency.

> Can the data be licensed as GPL-3 or similar?

Pretty unlikely tbh. I dont know if anything is decided for licensing, but if it is to be a "copyleft" license it would be cc-by-sa (like wikipedia) since this is not a program.

Keep in mind that in the united states, an abstract list of facts cannot be copyrighted afaik (i dont think this qualifies as that, wikidata might though)


How so? Wikimedia-provided data can be used by anyone. Google could have kept using and building on their Freebase dataset had they wanted to - other actors in the industry don't have it nearly as easy.


Denny seems to be leaving Google and joining Wikimedia Foundation to lead the project this month, so probably you do not need to worry too much about Denny's affiliation with Google.


As a long-time Wikipedian, this track record is actually worrisome.

Semantic Mediawiki (which I attempted to use at one point) is difficult to work with and far too complicated and abstract for the average Wiki editor. (See also Tim Berners-Lee and the failure of Semantic Web.)

WikiData is a seemingly genius concept -- turn all those boxes of data into a queryable database! -- kneecapped by academic but impractical technology choices (RDF/SPARQL). If they had just dumped the data into a relational database queryable by SQL, it would be far more accessible to developers and data scientists.


> WikiData is a seemingly genius concept -- turn all those boxes of data into a queryable database! -- kneecapped by academic but impractical technology choices (RDF/SPARQL). If they had just dumped the data into a relational database queryable by SQL, it would be far more accessible to developers and data scientists.

Note that the internal data format used by Wikidata is _not_ RDF triples [0], and it's also highly non-relational, since every statement can be annotated by a set of property-value pairs; the full data set is available as a JSON dump. The RDF export (there's actually two, I'm referring to the full dump here) maps this to RDF by reifying statements as RDF nodes; if you wanted to end up with something queryable by SQL, you would also need to resort to reification – but then SPARQL is still the better choice of query language since it allows you to easily do path queries, whereas WITH RECURSIVE at the very least makes your SQL queries quite clunky.

[0] https://www.mediawiki.org/wiki/Wikibase/DataModel [1] https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Fo...


The sparql api is no fun. Limited to 60s for example is death. I had to resort to getting the full dump.


How do you dump general purpose, encyclopedic data into a relational database? What database schema would you use? The whole point of "triples" as a data format is that they're extremely general and extensible.


Most structured data in Wikipedia articles is in either infoboxes or tables, which can easily be represented as tabular data.

  Table country:

  Name,Capital,Population
  Aland,Foo,100
  Bland,Bar,200
Now you need a graph for representing connections between pages, but as long as the format is consistent (as they are in templates/infoboxes) that can be done with foreign keys.

  Table capital
  ID,Name
  123,Foo
  456,Bar

  Table country
  Name,Capital_id,Population
  Aland,123,100
  Bland,456,200


> Most structured data in Wikipedia articles is in either infoboxes or tables

Most of the data in Wikidata does not end up in either Infoboxes or Tables in some Wikipedia, however, and, e.g., graph-like data such as family trees works quite poorly as a relational database; even if you don't consider qualifiers at all.


Those infoboxes get edited all the time to add new data, change data formats, etc. With a relational db, every single such edit would be a schema change. And you would have to somehow keep old schemas around for the wiki history. A triple-based format is a lot more general than that.


RDF shouldn't be lumped in with SPARQL


That’s the same set of technology. SPARQL is used to query RDF graphs, that’s pretty tightly coupled.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: