Sponsored by:

Data Discussions is a series of interviews with leading data management experts and practitioners,
presented by Wilshire Conferences. Click here to sign up to receive future editions.
FORWARDING THIS NEWSLETTER TO YOUR COLLEAGUES IS ENCOURAGED.

April 12, 2004

Meta Data meets Common Sense
An Interview with Doug Lenat 

People understand many things that computers don't.  We understand basic, obvious stuff, such as that a "customer" who has just called in an order is a human being, with fundamental characteristics such as being alive, having a birth date, and living at a physical location somewhereIf the customer is an 20-year old male, we can also make some reasonable estimates of other characteristics, such as that he has a mother, who is older than him, but that he cannot be a mother.  We know these things intuitively - common sense tells us these traits are obvious.  Computers on the other hand cannot reach these same conclusions very easily.  Until now...

Cyc (pronounced "psych") is a very long-term, high-risk technological gamble that has begun to pay off.  Cyc is essentially a knowledge base of common sense.  Begun as a research project in 1984, 20 years and $60 million later Cyc is a working technology with applications to many real-world business problems.  Companies use Cyc to unify disparate databases, prevent data quality problems, and warn when computer networks have vulnerabilities hackers can exploit.  Cyc already helps search engines produce more relevant results, and is being tested as an intelligence tool in the war against terrorism.  

Dr. Douglas Lenat is the founder and intellectual force behind Cycorp.  He sees the Cyc knowledge base as a sort of "super metadata" resource...an engine for software and databases that will add semantic understanding to a set of queries, integration challenges and other  application domains that have been totally intractable up till now.  He will speak at the upcoming DAMA International Symposium and Wilshire Meta-Data Conference, on May 2-6, in Los Angeles.  I was privileged to spend some time with him recently to discuss his work at Cyc, the implications of semantic intelligence within computers, and a grand future for meta data.

Tony Shaw, Wilshire Conferences (Wilshire): Doug, for the benefit of those who are not familiar with Cyc, can you please give us a high level description of what it is? Do you have a simple layman's explanation? 

Doug Lenat
President and CEO, Cycorp

Dr. Douglas Lenat is one of the world's leading computer scientists; a pioneer in efforts to apply large amounts of encoded knowledge to information management tasks. As head of Cycorp, an Austin-based corporation, Dr. Lenat leads groundbreaking research in an array of computer software technologies, including the formalization of common sense, a multi-contextual knowledge base, and an efficient inference engine. Doug was a principal scientist at Microelectronics and Computer Technology Corporation, where he led the CYC project. He is a former professor of computer science at Carnegie Mellon University and Stanford University, and the author of hundreds of papers, articles and books.

Doug Lenat (Lenat): Cyc is a knowledge base which contains millions of facts and rules of thumb that the average person knows and believes. We write them down in a formal language, so that given a set of facts or a query, a computer (or actually an inference engine) is able to automatically reach the same sorts of conclusions that a person would. For 50 years, researchers have tried to build an Artificial Intelligence using tactics like simulating evolution, by simulating an infant, by extracting facts from English texts, and so forth. In every case, a veneer of intelligence was achieved, but the software programs had no common sense. In 1984, I decided to do something about it, and for 20 years my team and I have devoted a person-century of effort to prime the knowledge pump - to seed the knowledge base with all those facts, so that it now has that "common sense."  This system, Cyc, may be the missing piece that enables all the other AI research efforts going on in the world to finally succeed.  

Wilshire: I understand that Cyc accumulates common sense in layers, in much the same way as people do, into a larger set of assumptions and understanding about the world.  Is that correct? Would you call this perhaps machine understanding? 

Lenat: That is exactly it, Tony. The computer ought to be able to add in common sense facts, rules, and constraints, just as a human being would. For instance, a flight that takes off from LAX at 5 p.m. and lands at JFK at 1 a.m. is landing at that time on the following day; it is likely to land about that time, to the nearest hour; and so on.  People understand these things quite intuitively, because they understand how the real world works. But up to now, unless they were specifically told, computers did not. 

 

Wilshire: What sort of potential does the technology have for the fields of data management and software development? Can you give me an example?

Lenat: Suppose I want to know which theaters within 5 miles of UCLA are showing films today that star someone born in Austin, Texas? You know that the information to answer that is available trivially on the Web, but you'd have to go to several sites to answer the sub-questions (e.g., the online Los Angeles theater guide; IMDB.COM (a comprehensive Movie Database); and MapQuest). Then by hand you would combine those answers into an answer to the original question. Why doesn't it happen automatically? Because traditional software engineering approaches to database integration such as Data Warehousing are combinatorially explosive in the number of data elements. The basic idea is that we can use Cyc as a sort of "semantic glue" or interlingua, to convert this into a linear task, a nonexplosive task. We explain to Cyc the meaning of each relation, field, column, etc., independent of all the others, and then it just goes off to each source as it needs to (using subgoaling). It's a little more complicated, because you want to bundle up whole sets of queries that databases can answer efficiently in parallel, and you need to explain the physical and logical layers of each source: URL's, passwords, SQL, web page forms, and such. But basically that's it - this enables n different information sources to have their content virtually integrated. I expect this to become the workhorse or foundation for the next generation of data integration applications and data visualization tools. 

Something we can demonstrate now, but will probably be little more than a demonstration for the next several years, is to have Cyc do other types of automatic program synthesis, drawing on its common sense knowledge to inject the same sorts of constraints that a human programmer would (e.g., for an airline reservation system, knowing that two adults won't be sharing a seat on the same flight, that people will never be on two different flights at the same moment, that each flight segment takes more than zero but less than 24 hours, and so on.) 

Appropriate application areas include any domain for which the salient application knowledge (entity types, entities, relations and rules) can be explicitly stated in a formal representation language, such as first-order logic or Prolog, and for which Cyc's vast store of commonsense knowledge will be able to fill-in semantic gaps. For instance, performing the intensive calculations required to compute the dispersal pattern of an aerosol nerve agent would not be a good application of Cyc, it would be better to have a physical plume dispersal simulator and just have Cyc know how to "call" it as a black box. But a good Cyc application would then "reason" with the results provided by that simulator to suggest and test hypotheses about likely targets for terrorism.


Wilshire: Companies are already using Cyc for commercial database integration? Can you provide an example?  And how it works?

Lenat:  The first version of our database integration framework was implemented for a major pharmaceutical company. They had over 100,000 data elements disbursed across scores of idiosyncratic sources: they subscribed to several streams of third party medical data, their different departments used different schemata, and the company was merging with another company with its own set of schemata

We learned a lot from that project, and our most recent revision of that framework, which we call Semantic Knowledge Source Integration (SKSI), underlies many of our projects for the US government (such as helping analysts generate plausible threat scenarios) and underlies joint ventures with other companies (such as an intelligent CRM manager).  

A complete Cyc language (CycL) description of an external source comprises three layers, or levels of abstraction. The access layer, the physical schema, and the logical schema.  (see fuller explanation at the end of this interview). The access layer and the physical schema together constitute what many people still mean by the term "metadata". The logical schema captures the high-level, humanly significant meanings that are usually only implicit in the external data source. This level of meaning is absent from the traditional understanding of metadata, and is now sometimes referred to in the literature as "semantic metadata". 

Finally, the CycL representation of an external source includes "mapping" statements that allow Cyc to translate data values to the form expressed in the logical schema and vice versa. Such mapping statements also make it possible for Cyc's inference engine to automatically translate CycL queries into the query protocol and syntax appropriate for each source, such as SQL or a web page form submission. Once a data source has been described to Cyc, the full power of Cyc's inference engine and enormous knowledge base can be brought to bear. The source is now integrated with all of the other data sources known to Cyc, not to mention the other millions of common sense things that Cyc already knows. 

Integration occurs at the semantic level, the level of the logical schema. This means, for example, that a source containing data about the locations and inventories of retail clothing outlets is automatically a candidate for joining with a source that provides the current ambient temperatures of cities (zip codes), even if, as seems likely, these two sources were developed for completely different purposes and share nothing in common at the physical level. Cyc knows that each of these sources has a field denoting, at the semantic level, a type of place (physical location), and that if a way to translate the Cyc term denoting a place (e.g., CityOfAustinTX) to the physical-level form required for each source can be found, then a join is possible, and in that case here is the expression to make that happen.

(Eds...Doug provided an extensive explanation of how Cyc actually integrates databases, which is included in the footnote at the end of this interview)


Wilshire: A lot of the people reading this are deeply involved in the field of metadata, and you're predicting a significant role for this technology in the future of metadata. In fact, this is what you'll be talking about at the DAMA Symposium + Wilshire Meta-Data Conference in Los Angeles on May 3. So tell us more about that please...and perhaps you could illustrate with an example or two. 

Lenat: There are really two main points I'll be talking about. The first is that a large, broad common sense knowledge base (ontology + rules about the meanings of the terms of the ontology) can be the heart of a semantic data integration capability. The example I just gave you, about checking Shopmart inventories and ambient weather conditions, shows what I mean by that. 

My second point is that there is a lot of regularity in the types of metadata that are available and useful, and we can factor that out into a dozen different species or categories. This includes the level of granularity at which the data is valid, the time and place at which the data was true, whether this is about the world as it is or as x believes it to be (or as x believes y believes z... believes it to be), and so on. You can think of these as a dozen different dimensions of metadata-space. When asking a query, you set the values of these dozen dials, sort of like Mr. Peabody's wayback machine, if you remember what that is, and then the answers are valid for the times, places, etc. you specified. The broader your settings, the more varied and inconsistent answers you'll get. E.g., if you ask whether Bill Clinton was a good President, the DNC source always said Yes, Al Jazeera always said No, the NY Times changed their minds at various times, and so on. If you narrow in the settings, you would just get one Yes or No answer back.

Because our approach breaks the warehousing bottleneck, the need for a human being to hand-map sources to each other, the Cyc-based SKSI can proceed in parallel, with each person responsible for an information source independently explaining their source to Cyc. This has increasing returns on demand - sort of like having telephones catch on; if you have one of the few telephones in the world, it's not nearly so useful as if almost everyone has one. Once the snowball gets rolling, and large numbers of sources have this Cyc SKSI metadata, the result will be like discovering a new power source. Practically any software application you can name could be profoundly affected by in effect giving it common sense and having it be virtually integrated with all the other information sources. This in and of itself is not HAL-like AI, it's more like an intelligent information retrieval and question answering system that really works. A century ago, another newly harnessed power source, electricity, transformed our society by making appliances such as washing machines available that changed the average person's life into something much closer to what only the very wealthiest individuals had experienced before, those with servants. In much the same way, I think that Cyc-based SKSI could turn out to be a power source with similar impact, automatically doing the sort of data integration labor which we have to do ourselves or have our servants or employees or graduate students do. Think of the impact that simple Boolean-combination-of-keywords searching has had on all of us; Cyc-based SKSI is the next qualitatively more powerful function being performed.


Wilshire: I've heard you say that metadata is a source of system vulnerability. Can you explain that assertion please? 

Lenat: One force that strongly facilitates security is obscurity or inaccessibility. If someone wants to prize some data out of your brain, they can't do it by cutting open your head, because no one currently understands how to "read" that encoding for the data. But the more explicit and complete the metadata, the more readily an agent (human or computer) could reason with and about that source. For instance, from patterns in the metadata they might be able to inductively guess where and how CNN collects its political poll results, and then tamper with the world so as to bias all subsequent CNN poll results. Of course meta (and meta-meta-) data can be used by an inference engine to help detect and thwart such nefarious inferences, and so on at each level of spy vs. spy. This is like the constant escalation of capabilities in police radar guns and driver radar detectors; each advances to just overtake the last advance of the other. Don't get me wrong, I think the power and advantages outweigh the cost, just like having electrical devices is cost-effective even though it has made the world a more dangerous place.


Wilshire: Let's talk about another application area within data management. Data quality for example...what would Cyc do for an organization in the DQ area? 

Lenat: Integrating an external data source with Cyc effectively means making it possible for Cyc's inference engine to apply to that source an additional set of semantic integrity constraints derived from the huge amount of commonsense knowledge in the Cyc knowledge base. Inconsistencies and errors could be detected and flagged for human action, or deleted. New, correct values could be computed and added. Cyc could apply constraints/rules from many domains. 

Here is a simple example. In one table, X is listed as the probable culprit of a 1995 hijacking, but in another table, X is listed as having died in an explosion in 1993. Cyc knows that people don't do volitional actions (like hijackings) after death, so one or both of those are just wrong.

Or consider a database which has been populated by an automatic text extraction system - something that reads the text and produces tuples to assert to the database. A text report says that Libya is in the U.N., and this gets translated into a tuple representing that the country Libya is physically contained within the United Nations building in New York City. Cyc knows that huge objects are generally not going to fit into vastly smaller containers, and countries are vastly bigger than office buildings, so this tuple must be wrong. If the extraction program does this DQ test incrementally, it might then try its next-best guess at what the meaning of the sentence is, pass that by Cyc for DQ, and so on, until one which is found which is not deemed impossible by Cyc. 


Wilshire: And how does Cyc integrate or interface with existing systems - to SQL for example, or to an Oracle database?

Lenat: Cyc can automatically generate SQL (in different dialects) to access and modify any SQL-capable external data source, if the source is adequately described in the Cyc knowledge base. We have not written our own database drivers in Cyc's implementation language (SubLisp), and so depend on externally developed drivers written in Java, C, or Perl. Cyc could be enabled to answer queries posed in SQL syntax, but much of CycL's expressiveness can't be duplicated in SQL. 

We will soon expand the SKSI framework to support querying and modification of external sources (including databases) via SOAP and XML-RPC. Also, we'll be trying some more experimental ways of having Cyc interact with databases. For DBMSs that support triggers (deductive databases), we might try to use Cyc as a trigger generator. The idea is to have Cyc determine what triggers (update mechanisms) would have to exist in the database to duplicate a chain of reasoning possible via Cyc inference rules, and then push that reasoning entirely into the DBMS by dynamically instantiating the required triggers.

Wilshire: One of the features of Cyc is that users can play around with it and put their own rules into the engine. What sort of rules might an organization put into the engine? Are there any general principles of knowledge representation (KR) that the user needs to keep in mind? 

Lenat: Some of these principles have to do with economy of expression: having the totality of the content of the knowledge base (KB) be stateable as tersely as possible, introducing new terms that result in a net shrinkage of the KB footprint. Some have to do with what information is and isn't salient; e.g., if you talk about a new university, it would be strange to not have an assertion about where the school is located. There are so many such KR principles that a user should keep in mind that we've had to develop multiple interfaces, one using clarification dialogue in English for more or less novice users and with many more "dials" and "blinking cursors" designed only to be used by logicians and programmers who have had extensive training and experience in using Cyc. The good news is that we also have crystallized out a Cyc API, so a developer can simply make remote calls to ASSERT, ASK, JUSTIFY, etc. assertions in Cyc from their application programs.

Cyc's own integrity constraints prevent a user from entering syntactically malformed assertions, as we illustrated a minute ago when talking about Data Quality. So Cyc itself can dynamically help to catch many kinds of semantic (conceptual) errors at entry time. 

Cycorp offers a graduated sequence of training courses for prospective and current Cyc users, and also provides on-site consulting and training for Cycorp customers. These services provide the best opportunity for users to realize the full potential of Cyc. Unfortunately, doing good knowledge representation is at least as difficult as doing good data model design, and while Cyc includes tools and features that can help to guide the user, there still is no substitute for supervised training and experience.

Wilshire: So again, in terms of the interests of this readership...Have you had much discussion with the Business Rules community? I know you've met with Ron Ross, one of the thought-leaders in the field...is there any synergy with what the BR folks are doing?

Lenat: Yes, I've met Ron, and read his books, and BR is great stuff. Here's how I look at things. There are several "sweet spots" on the tradeoff curve between (1) simplicity of expression and efficiency of use versus (2) inferential power to do sophisticated reasoning, e.g. involving nested modals. Relational DBs are one of these; OO DB's are another; spreadsheets are another; Business Rules is another; and Cyc-based SKSI is another.

There could be synergy between our goals for Cyc and the general goals of the Business Rules community. It might be more accurate to say that Cyc can already provide much of what the Business Rules community is seeking, at least with regard to theory and implementation, but only much more slowly and awkwardly than BR. If that suffices, if that's all you need, it's better to use BR.

In his excellent tutorial pieces for the "Foundation Matters" column on the http://BRCommunity.com web site, Chris Date articulates for members of the Business Rules community many insights and perspectives that seem to come directly from "logicist" AI and knowledge representation. I'll cite just one of these pieces to help situate Cyc's potential contribution. In "The Question of Meaning" (August 2001), while discussing the importance of integrity constraints, Date writes:

In an ideal world ... the DBMS would know the meaning of every relation, so that it could deal correctly with all possible updates. But, of course, that's impossible. There's no way it can know those meanings exactly. For example, there's no way the DBMS can know what it means for a certain supplier to be "in" a certain city or to "have" a certain status; these concepts are outside the system -- they're understood by users, but not by the DBMS. More precisely, they're part of what logicians call the interpretation (of the relation in question). 

It's precisely knowledge of this sort -- what it means for a certain supplier to be "in" a certain city or to "have" a certain status -- that one finds in Cyc. Cyc's vast store of commonsense knowledge embraces domains such as buying and selling, commercial organizations, and product types, but also geography (cities, states, countries, continents), human interpersonal relationships (kinship, friendship, emotions), temporal and spatial relationships, and many other chunks of human consensus reality. Cyc's immense vocabulary includes scores of predicates for stating precisely defined senses of "in". When a database's data model has been described to (represented in) Cyc, any conventional integrity constraints pertaining to the database or its parts can certainly be expressed in Cyc's logical representation language. But since the database is described in terms of preexisting Cyc concepts and relations, any commonsense rules/constraints that apply to those concepts and relations will apply to the database, too. Cyc automatically adds an additional, semantically rich layer of "integrity constraints" for any external source represented in the Cyc knowledge base.

Wilshire: We hear a lot about the Semantic Web.  Obviously you have a lot of expertise in this area, so I'd like to ask you to look into your crystal ball. Can the Semantic Web actually happen the way it's being talked about? Or is it too ambitious? 

Lenat: Tony, I think the problem is that it's not ambitious enough. It is a step in the right direction, moving toward sharing explicitly the meaning of what is being said in the content or marked-up material. But it stops close to the level of agreeing on terms, rather than demanding axiomatizations that guarantee the agreement on (most of) the meaning of those terms. And it stops short of demanding an explicit account of the dozen types of metadata I mentioned before, without which there will of course be blatant contradictions and inconsistencies in data-space (e.g., the identity of ThePresidentOfTheUSA changes over time.)

I agree with and applaud the basic direction the W3C is taking in this. The disagreement is over how deeply to implement the "Semantic" part of the Semantic Web. I fear that if we don't go deeply enough, the same sort of superficial markup inconsistencies will be so rife that one will not be able to reason over the content - combine it arithmetically and logically - with any confidence at all. It will, though, enable a qualitatively better level of performance if your task is merely to retrieve relevant passages of information, which you as a human being then integrate and draw conclusions from.

Some Semantic Web supporters believe that it is a good starting point, from which the deeper sort of "Really Semantic Web" I'm describing will eventually evolve. While it might happen - a good shared, stable system of meanings (necessary for communication, not to mention for correct inference) might evolve over a long period of time by unioning the correspondences between myriad tiny, idiosyncratic ontologies - it seems more likely to me not to happen that way. Just like if you want to have a bridge that spans the Mississippi river, you shouldn't just facilitate conditions under which such a thing might come into being naturally, you should engineer and design and build the thing, based on engineering and physical principles. In the same way, I think the Really Semantic Web will be most likely to appear by being an extension of a very large, very broad reference ontology of terms plus a body of constraints and rules that hangs onto that ontological skeleton and to first order defines those terms. The Dublin Core is too limited and fragmented, and in some ways is no better than lists of keywords or "bags of words". The IEEE Standard Upper Ontology will quite possibly be irrelevant, since most communication that matters in the Semantic Web will be about concepts/entities represented in the middle and lower levels of any "vertically robust" reference ontology. Some inferential utility should derive from the upper level, but this level mostly needs to provide suitable attachment points for the middle-level concepts, and not be so bad as to cripple the middle and lower levels, or result in unsound inferences. The idea of the Semantic Web is not too ambitious. We want it to succeed (i.e., we want to see something like the Really Semantic Web come into existence), we just think that even all of Cyc itself will just barely be enough of a foundation on which to erect it.

Wilshire: So where does a resource like Cyc fit within the Semantic Web vision? Does it have a formal role?

Lenat: This may not be necessary. We've been active in the DAML initiative, and Cyc can now "speak" (export, and probably import) ontologies/assertions encoded in OWL. To the extent that the Semantic Web proves to be a success, and the Really Semantic Web comes into existence, it may well be because Cyc becomes the most common reference ontology of choice, whether or not this role is officially, formally sanctioned. 

Wilshire: Thanks Doug, I appreciate your time today, and we'll look forward to your talk in Los Angeles on May 4.

Feedback or questions? Write to Tony Shaw, Wilshire or Doug Lenat, Cycorp.


This editon of Data Discussions is sponsored by SearchDatabase.com & SeachOracle.com:
SearchOracle.com features free information resources for IT pros working with Oracle technologies.
Browse select white papers and tips from database-specific information resource SearchDatabase.com



Join us for the
Wilshire Meta-Data Conference
and DAMA International Symposium

May 2-6, 2004 • Century Plaza Hotel • Los Angeles, California USA

The World's Largest Vendor-Neutral Data Management Conference

The 16th annual DAMA International Symposium and 8th annual Wilshire Meta-Data Conference will be held May 2-6, 2004 at the Century Plaza Hotel in Los Angeles, a beautiful venue adjacent to Beverly Hills. Hear 40 case studies outlining strategies of companies that have implemented successful data management projects. There will be more than 120 speakers in all, covering meta data, enterprise architecture, data and process modeling, unstructured data, business rules, data integration, XML, business intelligence, data warehousing, information stewardship, and more. Keynote Speaker Chris Date. Click here for details.

Discounted hotel rooms available at the Park Hyatt if you reserve your room by April 16 -- details here.


This "Data Discussions" is a series of interviews with leading data management experts and practitioners, presented by Wilshire Conferences. Click here for links to more Data Discussions interviews.

Click here to sign up to receive future editions.
For sponsorship information, contact Rick Froton at 603-305-0660.


©2004Wilshire Conferences, Inc. May be quoted with full attribution.


FOOTNOTE:

How does Cyc Integrate databases?

(as provided by Doug Lenat)

A complete CycL description of an external source comprises three layers, or levels of abstraction. The access layer includes the information required for Cyc to connect to the source, such as what it is (e.g., database, web site), the network address (host name, port number, URL), the communication protocol (SQL, SOAP), and the authorization tokens (user name, password). The physical schema includes information about the physical structure of the source, such the names of its parts and subparts (table names, field names) and the low-level data types of the fields (char, string, integer, etc.). The access layer and the physical schema together constitute what many people still mean by the term "metadata". The third layer is the logical schema, which captures the high-level, humanly significant meanings that are usually only implicit in the external data source itself, for instance the level of granularity or sophistication at which this source is accurate (e.g., "high school physics level", or "to three decimal digits of accuracy"). This level of meaning is absent from the traditional understanding of metadata, and is now sometimes referred to in the literature as "semantic metadata". 

Finally, the CycL representation of an external source includes "mapping" statements that allow Cyc to translate data values from the lower-level form represented by the physical schema (e.g., varchar) to the form expressed in the logical schema (e.g., the Cyc term Surname) and vice versa. Such mapping statements also make it possible for Cyc's inference engine to automatically translate CycL (logical) queries into the query protocol and syntax appropriate for each source, such as SQL or a web page form submission. Statements describing the cost (in terms of dollars, latency, privacy) and the relative completeness of a source allow the Cyc inference engine to use the sources efficiently, favoring those sources that are known to be inexpensive and complete.

Once a data source has been described to Cyc, the full power of Cyc's inference engine and enormous knowledge base can be brought to bear. The source is now integrated with all of the other data sources known to Cyc, not to mention the other millions of common sense things that Cyc already knows. 

Integration occurs at the semantic level, the level of the logical schema. This means that a source containing data about the locations and inventories of retail clothing outlets is automatically a candidate for joining with a source that provides the current ambient temperatures of cities (zip codes), even if, as seems likely, these two sources were developed for completely different purposes and share nothing in common at the physical level. Cyc knows that each of these sources has a field denoting, at the semantic level, a type of place (physical location), and that if a way to translate the Cyc term denoting a place (e.g., CityOfAustinTX) to the physical-level form required for each source can be found, then a join is possible, and in that case here is the expression to make that happen.

Let's suppose that our first source is a relational database named ShopMart, which includes tables named LOCATION and INVENTORY:

LOCATION:

store_id address      city    state zip
-------- -------      ----    ----- ---

1123     346 MAPLE ST AUSTIN  TX    78756
4234     802 VINE AVE HOUSTON TX    77002
1498     239 MAIN     DALLAS  TX    75208

...

INVENTORY:

store_id    product_code    stock
-------- ------------ -----

1123        LE-3342          3
4234        LE-3342         24
1498        LE-3342         10

...

(We'll assume that the product code “LE-3342” denotes a particular model and size of overcoat, and that the values in the field named “stock” are updated in real time to reflect current inventory.)

The logical schema for the LOCATION table would include an explicit representation (description) of entities that are only implicit in the table’s physical schema.  The data in each row of the LOCATION table are really “about” a particular retail store and its contact location.  In CycL, the logical schema representation would include statements that look like this:

1. (isa (TheLogicalFieldValueFn LOCATION-LS RetailStore 1) RetailStore)
2. (isa

    
(TheLogicalFieldValueFn LOCATION-LS PhysicalContactLocation 1)
            PhysicalContactLocation)
3. (isa (TheLogicalFieldValueFn LOCATION-LS USCity 1) USCity)
4. (isa (TheLogicalFieldValueFn LOCATION-LS IDNumber 1) IDNumber)
5. (isa
     (TheLogicalFieldValueFn LOCATION-LS StreetAddress 1) StreetAddress)
6. (isa
     (TheLogicalFieldValueFn LOCATION-LS ProperNameString 1) ProperNameString)
7. (isa (TheLogicalFieldValueFn LOCATION-LS ZipCode 1) ZipCode)
8. (pointOfContact
     (TheLogicalFieldValueFn LOCATION-LS RetailStore 1)
            PhysicalContactLocation
    
(TheLogicalFieldValueFn LOCATION-LS PhysicalContactLocation 1))
9. (objectFoundInLocation
     (TheLogicalFieldValueFn LOCATION-LS RetailStore 1)
     (TheLogicalFieldValueFn LOCATION-LS PhysicalContactLocation 1))
10. (placeInCity
     
(TheLogicalFieldValueFn LOCATION-LS PhysicalContactLocation 1)
      (TheLogicalFieldValueFn LOCATION-LS USCity 1))
11. (streetAddressText
      (TheLogicalFieldValueFn LOCATION-LS PhysicalContactLocation 1)
      (TheLogicalFieldValueFn LOCATION-LS StreetAddress 1))
12. (zipCodeForLocation
      (TheLogicalFieldValueFn LOCATION-LS PhysicalContactLocation 1)
      (TheLogicalFieldValueFn LOCATION-LS ZipCode 1))
13. (placeName-Standard
      (TheLogicalFieldValueFn LOCATION-LS USCity 1)
      (TheLogicalFieldValueFn LOCATION-LS ProperNameString 1))

....

At the level of the logical schema, the entries in the fields of the LOCATION table are instances of types denoted by CycL terms such as #$IDNumber, #$StreetAddress, #$ProperNameString, and #$ZipCode.  The field entries identify, or point to, still other entities, which are instances of types denoted by the CycL terms #$RetailStore, #$PhysicalContactLocation, and #$USCity.  These concepts, and the relations stated between them in the logical schema, are significantly more meaningful to humans, and more inferentially productive for Cyc, than the physical-level data types of the fields (integer, varchar).  Similarly, the logical schema for the INVENTORY table would include statements with the CycL terms #$RetailStore and #$WinterCoat, because the explicit entries in each row are intended to convey information about the current number of coats at a particular store.

Let's suppose that our second source is the National Oceanic and Atmospheric Administration's National Weather Service (NWS) web site, which includes a submission form that accepts a location identifier (zip code, or city name and state abbreviation) and returns weather information, including current ambient temperature, for the designated location.  The logical schema for the NWS web site includes these statements:

14. (isa (TheLogicalFieldValueFn NWS-LS USCity 1) USCity)
15. (isa (TheLogicalFieldValueFn NWS-LS Temperature 1) Temperature)
16. (ambientTemperature
      (TheLogicalFieldValueFn NWS-LS USCity 1)
      (TheLogicalFieldValueFn NWS-LS Temperature 1))

If a ShopMart manager wanted to implement a very precise just-in-time delivery scheme in which items of winter clothing are shipped to those stores where inventory is low and where the outside temperature is below freezing, Cyc could easily use the sources described above to answer a query like this:

 
Which Shopmart stores are located in a city where the current ambient temperature is below 32 degrees F. and currently have in stock fewer than 10 winter coats in stock.

Because of the information represented in the logical schemas, Cyc knows that it’s possible to form a semantic join between the LOCATION table and the NWS web site via the logical fields (TheLogicalFieldValueFn LOCATION-LS USCity 1) and (TheLogicalFieldValueFn NWS-LS USCity 1).  Similarly, Cyc knows that it’s possible to form a semantic join between the LOCATION table and the INVENTORY table via the logical fields that denote the retail store implied by each row (this is a trivial syntactic join as well).  Cyc’s knowledge of the access requirements, physical structure, and high-level semantics of these external sources allows it to build an SQL query expression for the ShopMart database, build a web query for the NWS web site, dispatch both queries, and combine the results to answer the query stated above.  Cyc’s ability to serve as the semantic glue between highly disparate external data sources means that whenever a new source is described to Cyc, that source is immediately “integrated” with all the other sources represented in the Cyc knowledge base.

Wilshire Conferences Home Page      Data Discussions Home Page      Meta-Data/DAMA Conference Home Page