Automatically recognizes data from Excel tables with different structures, multiple mistakes and crazy remarks
This use-case is a part of a complex platform "Automanager" - The Marketplace for wholesalers and retailers for auto tires and rims.
Created to collect and refresh prices from MarketPlayers
The project idea is receiving of unstructured information from Excel pricelists of many wholesalers, standardizing it and preparing for Retailers.
More than 100 Wholesalers send regularly own pricelists to the special service email. The Email Scraper extracts it from there and put into the tasks queue to handle
Let's look on that pricelists
Their structure is entirely different.
No any standards to point product's parameters or products titles.
A lot of errors in the names.
Some of them contain a mix of Latin and Cyrillic letters, which look the same
Some special symbols or signs instead of stock amount.
Even they use the color of a cell's background to point if the product exists in the warehouse or not
Prices are indicated in different currencies
Searching for a more optimal way to recognize
Moving step-by-step, we've started recognizing the first most difficult prices by the Python script, which we wrote separately for each new pricelist.
Such a way we collected more and more experience and understanding of situations. After the first 20 pricelists, we've understood which kind of a mess we could wait further.
The solution to connect each Python script to the related PriceList Profile could look not so perfect and not so secure. But quite enough for the startup with a limited budget.
How to handle it without coding
The next goal was to avoid programming coding for each new price. So understanding more and more similarities between different pricelists structure, we divided some settings for it. So the Client's staff can handle it themselves without special education.
The recognizing process also works through Synonyms Vocabulary, which continuously improving.
Manual works still left
So after automatic script working some percentage of records couldn't be recognized and required manual work. To simplify this process, we show the Percentage of Similarity
The possible better solution
Of course, we can improve this process entirely by using Machine Learning and other technologies. But it requires the vast data volume first. And the budget for development as well.
How much and how long?
Maybe you are interested in the costs and duration of this solution. Ok. It was just a part of the general continuous development and researching process simultaneously. Approximately it was about 30 hours of research and development. But not at once and time-to-time. The whole process of improving took 2 months
How it helps business?
The whole process works autonomously and without our maintenance. The Client uses it for their own Retail store and sells these services for others.
Have you interested in some similar solution?
Let's learn together where is the bottleneck in your Business to solve it optimal.