In Spain we have an old proverb: “God creates them and they come together”. In English the most similar proverb is: “Birds of a feather flock together”. The English version has got a reference to the race; however, the Spanish one has not. It is more generic, and it can be applied to any human characteristic.
The meaning of the proverb tries to show that individuals with a certain characteristic finally will be together because they will have finally similar preferences. I like the Spanish version because it is not related to the DNA, and it is more useful to analyze classic problems of people classification with computer algorithms.
We are living in a world where the classification of individuals is commoner than most people think, and computers are usually following that popular principle.
While in occidental societies any classification from race would be not considered politically correct, many IT systems could be following that directive automatically.
One of the most common and simple classification algorithms is known as the algorithm of the nearest neighbor. Many computers try to classify any new object searching for the nearest object in the space of characteristics, and then it is automatically classified in the same group as the nearest object.
The validity of this method depends on how many characteristics are involved and defining the space of characteristics, and how they are measured in order to provide a mathematical distance.
Some years ago, I was working in a company as innovation manager. Human resources department hired a new girl to work with them. She was living in same street I was living and I grew. One day we meet in the bus stop. As she was a work colleague, I said her hello. She asked me why she never saw me before. My answer was that I always studied in private schools far from that street and later I went to a far city in order to work after the university. It seems that the nearest neighbor algorithm does not fit well this situation.
Social networks are getting information like location every day about people and then they classify people from those properties in order to provide advertising; however, I was a living example that this is not a good way of classification.
If DNA is not politically correct, and neighboring location is not good enough, how can we make a good classification?
First of all, the nearest neighbor algorithm cannot be taken in a literal sense. A good classification is searching for neighbors in the space of characteristics where physical location can be only one of them as most. The same sentence in a mathematical context can be very far from the meaning of that sentence in a social or political context. A problem arises when you are using mathematics to analyze social situations. Something that is very common with social networks. We can see that the solution can be very different if the project is driven by a mathematical scientist or by a politician due to the different use of the language.
On the other hand, we need to improve the algorithm. There is another algorithm more complex that can be used instead. It is known as k-nearest neighbor. The algorithm is similar, but now we are searching for the group that has k elements nearby. Although it is better than the simple nearest neighbor, it is prone to the same errors.
A good classification depends on how the space of characteristics is defined and how the information is gathered and distributed. This can be more important than the algorithm itself.
Automatic IT systems for classification are not only a matter of IT algorithms implementation but a matter of system design mainly. Artificial intelligence provides techniques to cope with more complex situations; however, it cannot be good enough if the system is not properly designed in terms of selection of characteristics and the required information.
Computer scientists have spent several years analyzing classification problems with mathematical optimization algorithms and the introduction of AI techniques as neural networks, however, these techniques will not provide a good result if the system never was properly designed selecting the proper characteristics required to solve the classification problem. A good system would be got from a good IT engineering instead of only good programming. System architecture is at least as important as computer algorithms.