DataOps Archives - SD Times

DataOps engineers run toward error and automate it away

Jakub Lewkowicz — Fri, 20 Aug 2021 20:41:53 +0000

The DataOps role is unique in the space of data analytics, with its goal to enable data engineers, scientists, analysts and governance to own the pipelines that run the assembly process. Essentially, DataOps engineers work on, but not in, these pipelines, according to a DataKitchen webinar titled “A Day in the Life of a DataOps Engineer.“

“We want to run our value pipeline, like Toyota makes changes. We also want to be able to change that pipeline, take a piece of it, change it, and be able to iterate quickly and change our pipelines as fast Silicon Valley companies do on their websites,” said Christopher Bergh, the CEO and “head chef” at DataKitchen.

The space of DataOps combines Agile development, DevOps, and statistical process controls and applies them to data analytics.

However, the current challenges in organizations stem from the fact that people don’t all have the mindset that their job is to deliver value to the end user, since they’re so focused on their immediate task at hand.

“The challenge is that in a lot of ways, the DataOps role to a lot of people that do data engineering and data science isn’t there. It’s not apparent. So if they’re going to build something like a bunch of SQL or a new Jupyter Notebook they kind of throw it to production and say I’ve done my work. My definition of done as a data engineer is, it worked for me. A lot of time the challenges for people who are doing data analytics is they focus on their little part and think the process of putting it into production is someone else’s problem,” Bergh said. “It’s very task-focused and not value-focused. Done should mean it’s in production.”

DataOps engineering is about collaboration through shared abstraction, whether that’s putting nuggets of code into pipelines, creating tests, running the factory, automating deployments, and working across different groups of people in the organization. It’s then about automating many tasks. “DataOps engineering is about trying to take these invisible processes, pull them forward and make them visible through a shared abstraction and then automate them,” Bergh said.

The challenge when it comes to automation, similar to many other scenarios, is that no one fully owns the process. This is where DataOps engineers come in.

“While implementing a DataOps solution, we make sure that the pipeline has enough of a variety of automated tests to ensure data quality and to leave time for more innovation and reduce the stress as well as fear of failure,” said Charles Bloche, a data engineering director at DataKitchen.

In effect, every error leads to a new automated test that then improves systems. It is also the DataOps engineers’ role to test every step of the way to catch more errors to recover faster, and to also empower collaboration and re-use.

“For a data warehouse, the product is the dataset; for an analyst the product is the analysis, and for a DataOps engineer, the product is an effective, repeatable process,” Bloche said. “We are less focused on the next deadline, we’re focused on creating a process that works every time. A DataOps engineer runs toward error, because error is the key to the feedback loop that makes complex processes reliable. Errors are data.”

The post DataOps engineers run toward error and automate it away appeared first on SD Times.

SaaS backup: A more scalable way to ingest cloud app data

Joe Gaska — Tue, 26 Jan 2021 17:27:52 +0000

It’s probably not surprising that, according to a 2018 Gartner survey about SaaS migration, 97% of respondents said their organization had already deployed at least one SaaS application. Today, a significant number of cloud applications have been elevated to the status of ‘critical-business system’ in just about every enterprise. These are systems that the business cannot effectively operate without. Systems that are used to either inform or to directly take really important action.

It’s no wonder cloud applications like CRM, Support, ERP or e-commerce tools, have become prime targets for DataOps teams looking for answers about what and why certain things are happening. After all, think about how much business data converges in a CRM system – particularly when it’s integrated with other business systems. It’s a mastered data goldmine!

DataOps teams often identify a high-value target application, like a CRM system, and then explore ways to capture and ingest data from the application via the system’s APIs. In the case of, say, Salesforce, they might explore the Change Data Capture and Bulk APIs. Various teams with different data consumption needs might then use these APIs to capture data for their particular use case, inevitably leading to exponential growth in data copies and compliance exposure. (After all, how do you enforce GDPR or WORM compliance for data replicas tucked away God knows where?!).

When they encounter API limitations or even application performance issues, DataOps teams then start to replicate data into nearby data lakes. This enables them to create centralized consumption points for the SaaS data outside of the application. Here, storage costs are more favorable and access is ubiquitous. Here, teams typically take a deep breath and start a more organized process for requirements gathering, beginning with the question of “who needs what data and why?”

Meanwhile in a parallel world, IT teams implement data backup strategies for those same cloud applications. If something bad happens (say, data corruption), these critical business systems need to be rapidly recovered and brought back online to keep the business going. Here, standard practice is to take snapshots of data at regular increments either through DIY scripts or with SaaS backup tools. In most scenarios, the backup data is put in cold storage because… well, that’s what you do with data replicas whose sole purpose is to act as an ‘insurance policy’ in case something goes wrong.

With all of these teams trying to consume the same data in the same organization, it makes sense that costs and maintenance cycles quickly spiral out of control. For every TB of production data, ESG identified that another 9 TB of secondary data is typically generated – rapidly offsetting any cost savings due to ever-decreasing storage costs on public clouds.

So why are we inflicting this 9X+ data multiplier on ourselves?

One reason is convenience. It’s just easier to walk up, grab what we need and walk away. But convenience can often come at the cost of quality, security and risk: how you do you the data you are grabbing is the best possible dataset the organization has on a particularly entity? This question is particularly important in organizations that have strong data mastering initiatives. If your replicas contain sensitive data that you are tucking away in some generally unknown place, are you expanding the attack surface area for the organization? Are there governance or compliance regulations that your data may fall under?

Another reason is because “we’ve always done it this way.” The status quo of thinking about backup data as an insurance policy that is separate and unrelated to SaaS data ingestion for other scenarios, reaches back before the days of SaaS applications themselves – when data backup and ingestion were two separate motions done on the database level.

How we do things is just as important as doing them in the first place. And changing HOW we do things is hard. It starts with the realization that the status quo no longer applies. In this case, the realization that cloud applications allow for fundamentally different data consumption patterns – and that backup tools can be the perfect hat trick to take back ownership and control of your cloud application data, and to re-use backed up data for all other data consumption needs across our organizations.

The post SaaS backup: A more scalable way to ingest cloud app data appeared first on SD Times.

SD Times news digest: DataKitchen’s DataOps Transformation Advisory Service, Netlify team overview, and Rackspace’s new IoT solutions

Jakub Lewkowicz — Mon, 26 Oct 2020 15:49:32 +0000

The new DataOps advisory service by DataKitchen aims to help customers achieve an enterprise DataOps transformation by leveraging industry-leading DataOps expertise and the company’s critical capabilities necessary to launch a successful and sustainable DataOps initiative.

Customers can now choose from a menu of services such as Strategic DataOps, Technical DataOps, Maturity Model Assessment, DataOps Dojo and more.

“Many companies know that DataOps provides the foundation for analytic excellence, but struggle when it comes to designing and executing a DataOps plan. Our software is an important piece of the puzzle because it automates all the critical elements of a DataOps program – orchestration, testing, environment creation and management, and deployment,” said Chris Bergh, the founder and CEO of DataKitchen.

Netlify team overview simplifies collaboration
Netlify announced the release of Team Overview, a central dashboard in the Netlify UI that surfaces the most important information about teams and the projects they’re working on.

Users can see the real-time status of a team’s site builds to the latest audit logs all in one place so that everyone has a common understanding of what’s happening across a team’s websites and apps.

At a glance, users can see team usage, build status, audit logs, team members, sites, team logo and more.

Rackspace announces new IoT solutions

Rackspace’s enhanced IoT capabilities will enable enterprises to develop competitive products, increase operational efficiency and create new revenue streams.

The new enhancements include IoT Accelerator, which is a 5-day assessment led by a team of IoT experts who value IoT use cases and provide recommendations for pilot solutions, production solutions to design and develop tailored solutions for IoT users and more.

“Our enhanced IoT services and capabilities enable enterprises to reduce the time and cost associated with piloting new initiatives, free up resources typically allocated to navigating complexities, and ultimately, speed up time to market,” said Tolga Tarhan, the CTO of Rackspace Technology.

Linux 5.10-rc1 released
The release includes driver updates and changes.

This includes Christoph’s setf_fs() removal which Linux creator Linus Torvalds found interesting because the whole model of set_fs() to specify whether a userspace copy actually goes to user space or kernel space goes back to pretty much the original release of Linux.

Additionally, x86, powerpc, s390 and RISC-V have had the address space overrides removed, and all the core work is done. Additional details on the new release are available here.

Apache weekly update
Last week in Apache Software Foundation news includes the release of Apache XMLBeans 4.0 for accessing XML by binding it to Java types.

Other releases this week include Apache Jackrabbit 2.21.4, Kylin 3.1.1, Arrow 2.0, and more.

Apache Camel 3.6.0 was also released with speed optimizations, the functionality to avoid throwing exceptions, and Camel 3 has been changed from being a singleton to prototype scoped.

Additional details on all of Apache’s releases are available here.

The post SD Times news digest: DataKitchen’s DataOps Transformation Advisory Service, Netlify team overview, and Rackspace’s new IoT solutions appeared first on SD Times.

Software predictions for 2020 from around the industry

SD Times — Tue, 17 Dec 2019 20:16:52 +0000

Thought leaders weigh in on what we can expect from the software development industry in 2020:

Adam Scroggin, CEO of CardBoard
DevOps will continue to be key as we move toward 2020. Software teams will notice more and more that once a product is released, it is not done. Software products are never done. We have begun to see more applications moving to mobile and web, which allows software teams to instrument their product to learn if customers are using what they released and how much value they are getting from it. Not all ideas are good ones, but getting out there and testing them before scaling will be vital for the next decade. Good DevOps practices have paved the road for ideas to move into production quickly.

Monte Zweben, CEO of Splice Machine
“Cloud Disillusionment” blossoms because the meter is always running. Companies that rushed to the cloud finish their first phase of projects and realize that they have the same applications they had running before that do not take advantage of new data sources to make them supercharged with AI. In fact, their operating expenses actually have increased because the savings in human operators were completely overwhelmed by the cost of the cloud compute resources for applications that are always on. Ouch. These resources were capitalized before on-premise but now hit the P&L.

RELATED CONTENT: Gartner’s top 10 technology trends for 2020

Antony Edwards, COO of Eggplant
Technology is going to become increasingly regulated across the globe. Testing will not escape this, and by 2025 AI algos will need government certification. Testing will need to be able to guarantee that the system is safe to release, delivers the desired experience and that it’s ethically sound. In the 2020s, testers will become software optimizers. They will focus on utilizing intelligent technology to help digital businesses continually improve.

Scott Johnston, CEO of Docker
Containers pave the way to new application trends — Now that containers are typically considered a common deployment mechanism, the conversation will evolve from the packaging of individual containers to the packaging of the entire application (which are becoming increasingly diverse and distributed). Organizations will increasingly look for guidance and solutions that help them unify how they build and manage their entire application portfolio no matter the environment (on premise, hybrid/multi-cloud, edge, etc.)

Tatianna Flores, head of Atos North America’s AI Lab
In 2020, AI product companies will incorporate elements of reinforcement learning and wide-scale data sharing to remain competitive. 2019 revealed that highly specialized applications of AI geared toward industry-specific problems are hot commodities. Tesla acquired a company that focuses exclusively on object recognition, and McDonalds acquired a speech recognition company focused on languages. In the coming year, we’ll see even greater competition to improve performance in these popular and specialized applications of AI. Products will need to integrate reinforcement learning to constantly improve deep learning applications and stay ahead of their competition. Also, movement toward wide-scale data sharing will occur more rapidly.

John Pocknell, senior solutions product manager for Quest Software’s information management business unit
NoSQL will gain momentum. NoSQL hasn’t seen a huge amount of movement in recent years, but I believe we’ll see it pick up more next year, especially as people move towards fresher and newer data needs. While relational databases are good for traditional workloads like OLTP applications and business analytics (OLAP), for more complex OLTP workloads that include low-latency applications, NoSQL is better (versatility, agility, scalability). Ultimately, it’s a matter of getting the right database to suit the workloads of the organization, especially with the variety of structured and unstructured data in use.

Tim Tully, CTO of Splunk
2020 will be the year of the indulgent user experience, and that doesn’t bode well for the holdouts. Even as enterprise and industrial applications evolve, they’re not yet consumer-friendly enough for daily users. Enterprise software companies who are still producing dull user experiences will find it harder to keep their users loyal, and will be even more vulnerable to disruption. When it comes to enterprise UX, the companies that will succeed are the visionaries that design software to make people’s entire experience better.

Srinath Perera, vice president of research at WSO2
Cloud APIs will democratize AI. To date, custom AI model building has been limited to large organizations with the resources to tackle the complexity of AI deployment and management, not to mention the scarcity of experts and data. But now, cloud APIs make it possible for a few organizations to concentrate on providing the expertise and data required to solve a given problem, and then share or market the AI models they build. In this way, cloud APIs hold the promise to solve many AI use cases in 2020 by letting organizations of all sizes gain access to AI models provided by data experts.

Prince Kohli, CTO of Automation Anywhere
RPA will play a pivotal role in global data privacy and governance initiatives. The 2020s are shaping up to be the decade defined by big data – with the advent of 5G and the explosion of connected devices. In this new era, we’ll see even more pressure on companies to be fully transparent about the information they collect and how it’s used, with legislation like GDPR and the upcoming California Consumer Privacy Act (CCPA) representing only the tip of the data governance iceberg. Additionally, as malware increasingly becomes enhanced with artificial intelligence (AI) to identify network vulnerabilities, intelligent, secure bots will be a critical line of defense against data breaches.

Matthew Halliday, co-founder and VP of product for Incorta
Quantum computing applications will take off in 2020: Quantum computing remains in the most nascent stages of development, but the possibilities are fascinating – quantum computing unlocks a new world of use cases that were previously impossible. While we may still be years away from widespread use cases, the number of initial applications will skyrocket in 2020, as companies like Google and IBM join smaller outfits like Quantum Thought in beginning to commercialize their quantum abilities. As a result, 2020 will bring heavy investments in quantum computing applications from venture capitalists and major enterprises alike – the upside is simply too great to ignore.

Avon Puri, CIO of Rubrik
Data privacy takes the next step. It used to be that organizations had to spend millions of dollars on consultants to find out where PII (sensitive) data lived, but today there are a number of data privacy and governance technologies that can bolster

and data practices. Next year will see an inflection point in organizations finally understanding more about their data – which will be critical to improving data privacy standards as an industry.

Jans Aasman, CEO of Franz, Inc.
Digital immortality will emerge: We will see digital immortality emerge in 2020 in the form of AI digital personas for public figures. The combination of Artificial Intelligence and Semantic Knowledge Graphs will be used to transform the works of scientists, technologists, politicians and scholars into an interactive response system that uses the person’s actual voice to answer questions. AI digital personas will dynamically link information from various sources – such as books, research papers and media interviews – and turn the disparate information into a knowledge system that people can interact with digitally. These AI digital personas could also be used while the person is still alive to broaden the accessibility of their expertise.

Kirit Basu, VP of products for StreamSets
DataOps will gain recognition in 2020: As organizations begin to scale in 2020 and beyond — and as their analytic ambitions grow — DataOps will be recognized as a concrete practice for overcoming the speed, fragmentation and pace of change associated with analyzing modern data. Already, the number of searches on Gartner for “DataOps” has tripled in 2019. In addition, StreamSets has recognized a critical mass of its users embracing DataOps practices. Vendors are entering the space with DataOps offerings, and a number of vendors are acquiring smaller companies to build out a discipline around data management. Finally, we’re seeing a number of DataOps job postings starting to pop up. All point to an emerging understanding of “DataOps” and recognition of its nomenclature, leading to the practice becoming something that data-driven organizations refer to by name.

Michael Morris, CEO of Topcoder
So what’s the future of work? It’s the passion economy. Forget the set-schedule work week–the future of work will be driven by the “passion economy,” especially in the tech world. As the prevalence of open workforce models grow, freelance designers, developers and data scientists will shift loyalties to the work that’s out there, rather than a specific company. In order to recruit and retain people with coveted tech skills, companies will need to provide interesting projects for the freelance community that challenge and inspire them.

Chris Patterson, senior director of product management at Navisite
Big data democratization will make everyone data analysts. Big data has been a buzzword for so long, it has lost value. But, in 2020 and beyond, we’ll see it begin to provide real, tangible results. One reason for this is that data warehousing tools have improved and are no longer inhibitors to accessing enterprise insights in real time. Going forward, employees and stakeholders – from IT to the Board of Directors – will be able to more easily tap into the data well and become analysts themselves. And, with the democratization of data, the focus will shift from how to access data to: 1) asking the right questions of data, and 2) identifying who within your company is best positioned to analyze and glean answers from that data.

Maty Siman, founder and CTO at Checkmarx
Open source vulnerability. With organizations increasingly leveraging open-source software in their applications, next year, we’ll see an uptick in cybercriminals infiltrating open-source projects. Expect to see attackers “contributing” to open-source communities more frequently by injecting malicious payloads directly into open source packages, with the goal of developers and organizations leveraging this tainted code in their applications.

Steve Burton, DevOps evangelist for Harness
DevOps Teams will continue to replace Jenkins. There will be a new breed of CI/CD solution where engineers won’t write a single script, update a single plug-in, restart a single slave, work late nights or weekends debugging their failed deployments. Instead, engineers will adopt Continuous Delivery as-a-Service where deployment pipelines auto-verify and rollback code, thus allowing engineers to get their lives back after 6 and spend weekends with their family and kids.

George Gallegos, CEO of Jitterbit
2020 will be a test for the integration market. The integration market is one of the hottest markets today, and we don’t expect demand to slow. But integration comes in many flavors, and while traditional integration offerings may work well for a small subset of businesses, the biggest impact and growth will occur in enterprises undergoing digital transformation and relying heavily on comprehensive connectivity strategies. The past year was marked by several acquisitions and partnerships as integration and API vendors scrambled to expand capabilities to support enterprise-class needs. 2020 will be a test to see which bets worked, and I suspect only a handful of vendors are well equipped to address all of the aspects of enterprise class iPaaS, and those who are not, will become even more stark.

Oskar Sevel Konstantyner, product owner and team lead at Templafy
In 2020 we’ll see enterprises ensuring that their choice of cloud doesn’t limit their agility and performance. While AWS, Azure and Google Cloud look very alike, they do have specific distinguishing features. Enterprises are moving toward multi-cloud computing to not limit themselves to the features of a single cloud. Initiatives like Azure Arc, where it’s possible to deploy Azure technology on Amazon servers, clearly shows how cloud vendors support this journey. 2020 will be less about retaining customers by locking them to a single cloud vendor, but instead convincing them to stay by being the best in some areas – and admit that other vendors might offer better services in other areas.

David Cramer, co-founder and CEO of Sentry
Tool and framework frenzy will continue; fatigue will worsen: The plethora of tools, languages, and frameworks are adding massive complexity to the application development ecosystem. IT teams are challenged to interconnect these disparate languages and platforms to build applications that are the lifeblood of business in today’s digital economy. And while conference halls echo with cries of tool and framework fatigue, there will not be a clear resolution in 2020. In fact, there will likely be more disruption. Although it seems React.js is approaching victory for frontend development, there are still a number of viable competitors ready to shake things up. On the backend, there is still no standardization, in spite of significant innovation in recent years. PHP, Ruby, Python, Node.js, Java, and .Net are all in use—but there is no clear winner and that won’t change in 2020. As teams struggle to connect it all, even more tools—many of which will be open source—will emerge to integrate technologies, but the challenges of complexity and control will get worse before they get better.

Adam Famularo, CEO of ERwin
Data finds a soul. Highly regulated industries will begin to change their philosophies, embracing data ethics as part of their overall business strategy and not just a matter of regulatory compliance. In addition, ethical artificial intelligence (AI) and machine learning (ML) applications will be used by organizations to ensure their training data sets are well-defined, consistent and of high quality.

Alan Jacobson, chief data and analytics officer, Alteryx
The CDO role is evolving. The CAO is the new breed: The role of the data chief is changing, as is their title. The chief data officer needs to progress, and in 2020 the chief analytics officer title will really rocket upwards. It’s a manifestation that at last the role, and the projects managed within business, are less about data and more about what businesses are doing with it. The CAO is now a type of digital transformation officer – and in fact could just be termed a transformation officer – a sign that those in the role are becoming more tightly focused on what business success is really about.

Vanessa Pegueros, chief trust and security officer at OneLogin
With the convenience of what the iPhone has brought to the masses with facial recognition, end users will continue to expect similar offerings from most if not all applications in 2020. Although facial recognition has its flaws, the convenience outweighs the concerns for users.

Robert Reeves, co-founder and CTO of Datical, a database release automation provider and the creator of the open source tool Liquibase
The adoption rate of new technology will dramatically increase, especially with open source. Just look at Kubernetes — we were all amazed at how quickly that proliferated. The same thing is going to happen with technologies like Spinnaker, but even faster. JPMorgan Chase made a public declaration of their commitment to Spinnaker at SpringOne, and we’re going to see more companies do the same. Based on this, CIOs need to actively explore these new technologies and pay attention to what their developers are interested in, as this will indicate the areas they need to invest in.

The post Software predictions for 2020 from around the industry appeared first on SD Times.

Delphix 6.0 released with DataOps Marketplace

Jakub Lewkowicz — Mon, 25 Nov 2019 13:50:37 +0000

In an effort to speed up enterprise application development, Delphix has announced the latest release of the Delphix Dynamic Data Platform (DDDP). Version 6.0 is designed to eliminate test data wait times and accelerate application release cadence.

The company also announced the release of its DataOps Marketplace, which aims to help data teams integrate all data sources with DevOps tools, cloud platforms and other parts of the SDLC.

“As companies move along their DevOps and Cloud journeys, they are waking up to the realization that access to data can cause delays that slow their DevOps teams down; holding them back from achieving the speed necessary to compete in today’s digital business landscape,” said Jim Mercer, research director for IDC. “Effective DevOps teams need self-service access to data that can be driven by APIs and supports diverse data sources that may be located on-premises or in the cloud.”

The Delphix 6.0 Platform includes support for Google Cloud, a virtualization SDK that allows users to develop plugins for any data source and extensible masking to create tailored masking solutions.

“Enterprises look to Google Cloud to accelerate their digital transformation journeys and support critical business initiatives,” said Robert Harper, director of channel sales, partnerships & alliances at Google Cloud. “This Delphix integration is an important development for enterprise customers to accelerate DevOps and application workload delivery to Google Cloud Platform.”

The DataOps Marketplace includes data sources showcase, DevOps and automation showcase and cloud deployment showcase.

“The Delphix DataOps marketplace is the logical next step for the company. It will allow customers and partners to create, share, access and even monetise the work they have undertaken in making Delphix work across the many systems and data sources that complex businesses work with today and tomorrow. This marketplace will accelerate the opportunities for Delphix and their customers to integrate all their data sources into their DevOps workflows,” added Tim Sheedy, principal advisor at EcoSystm, a technology research and advisory firm.

The post Delphix 6.0 released with DataOps Marketplace appeared first on SD Times.

DataOps is more than just DevOps for data

Christina Cardoza — Wed, 13 Nov 2019 17:13:20 +0000

Development, testing, security and operations have all been transformed to keep up with the pace of software today — but one piece is still missing. Data is now becoming a roadblock to Agile and DevOps initiatives.

“People are getting stuck with data saying ‘I have my infrastructure layer automated and self-serviced so a developer can push a button and an environment can be configured automatically. I have made my entire CI/CD pipeline, my entire software delivery life cycle automated. I can promate code. I can test code. I can automate testing. But the last layer is data. I need data everywhere,’” said Sanjeev Sharma, vice president and global practice director for data modernization and strategy at Delphix.

As a result, development teams are starting to turn to DataOps to help speed up that data layer. SD Times recently caught up with Sharma who spoke about what DataOps means, how to be successful, and what’s next for data.

SD Times: I’ve heard people refer to DataOps as just another term for DevOps, so how would you define DataOps?
Sharma: If you look at the history of the word DataOps, it started off mainly from the data science people — people wanting to do artificial intelligence and machine learning who had lost data assertion.

I was talking to a client of ours who was saying most data scientists don’t come from a computer science background, so their method of versioning data is “save as” and put a number at the end of the file name. It is that primitive. Of course he was exaggerating, but what he was saying is that there is no way to manage data.

Our perspective of DataOps is very simple. In your enterprise, you have data owners: people who create the data either because they: own the application; customers use that application and data is created; or the data is coming from logs [such as] telemetry data from a mobile application or a log data from something running in production. And then there are data managers. These are the database administrators and security people whose job is to manage the data, store it and secure it. Then there are the data consumers. These are your data scientists, your AI and ML experts, your developers and your testers who need the data to be able to do their job. How do you make these three sets of stakeholders work together and collaborate in a lean and efficient manner? That is DataOps.

It involves process improvement, and it involves technology.

So do you follow or recommend people look at the DataOps Manifesto?
The DataKitchen team wrote the manifesto, which is a data science company, so they have a data science-centric view of data, but the manifesto is a great thing. It sets up some of these things that I am talking about out in the open to say it is not just technology. It is not just building a data pipeline. If you don’t change the organizational ownerships and bring out the responsibility between the data consumers, data owners and data managers, you are not going to succeed. That explains it very well. I think it is a great opening move. I wouldn’t say it is the final word though.

What makes a successful DataOps initiative?
DataOps has two perspectives. If you are looking at it from a data science lens, you are looking at how you got your data science activities to a stage where the biggest source of friction is the inability to get the right data at the right time to the right people.

From a DevOps lens, you are asking yourself if you have reached a stage where you are struggling with getting the right data to the right people at the right time… and you might not experience that unless you are Agile. If you still have a six month waterfall life cycle, six months is enough time to make a copy of a database. But if you are doing daily builds, true CI/CD and doing daily deployments to test environments — you need that data to be refreshed daily, sometimes multiple times a day. You need developers to be able to do local data sets for themselves, and be able to branch data to do A/B testing. You are more likely to hit that friction point when you have already done some level of automation around environments and code. Data won’t be what you address first.

What are the benefits database owners and database admins get from DataOps?
Data managers are hired and paid to manage data, store it, make it available to the people who need it, and secure it to make sure they don’t get hacked. They are there to manage data in a lean and efficient manner. Making copies of data for data consumers is not their job. It is something a developer opens a ticket and tells a DBA to do. That ticket is the last one on the list because the database admin had other tickets that say this database needs to be finetuned because it is not performing properly; this database index needs to be reindex; I need to add a new database for this new production environment, or I am running out of storage. All those will have a higher priority over a developer asking for a copy.

Why not automate that and provide self service to the data consumer. It makes their job more efficient because they can focus on the high priority tasks like managing data schemas or making the database, rather than low level copies.

From a data owner perspective, if the data is not being used, what use is it? It is just being stored. It is just sitting there. They have data for 20 years, but the data consumer only has access to the last three years. To a business owner, they are looking at what information, what insights and inferences they are not able to access because of a policy that says I can’t give that to anyone. For them, they want data they can use as an asset which can be mined, used to draw inferences to better understand their customers, make better predictions, and make better investment decisions. Getting business value out of data is what DataOps brings to them.

How can you keep DataOps initiatives on track?
DataOps by itself has no value in the sense of it has to be in the context of either you are doing a data science initiative and you need the data to be Agile for that initiative, or you are doing DevOps and you need data to be Agile for DevOps. A DataOps initiative has to be attached to a DevOps or data science initiative because it is serving that purpose of making data lean, Agile and available to the right people.

That train needs to be moving and DataOps is just making the track straighter and faster.

How do data regulations and data privacy concerns come into play in a DataOps movement?
One of the tenets of DevOps is to make production-like environments available, which means the data should be production. It shouldn’t be synthetic data. Synthetic data doesn’t have the noise and the texture of production data. You will need synthetic data if you are building a new feature where data that doesn’t exist in production yet, but for everywhere else you want to put production data in your lower environments — but that raises security and compliance restrictions.

We at Delphix do masking of the data. We do it at two layers. We mask the data, so we will replace all the sensitive information with dummy information while maintaining the relational integrity.

The second thing we do is put in a lot of identity and access management controls. For instance, we can put in policies that say if the data is not masked and classified at this level you can not provision it to an overseas environment.

What is the state of DataOps today?
It is where DevOps was maybe 8 years ago where we were spending time explaining to people what was DevOps. Today, we don’t do that. We don’t need to explain to anyone what DevOps is. It is very well established, even though there are multiple definitions floating around, they are all at least on the same playing field.

With DataOps, I think we are still at that “what is DataOps and does it applies to me” stage. I would say there are still a couple of years before you have a DataOps Day or a conference dedicated to it.

What is still to come from the DataOps movement?
Most of the world’s data is still living on a mainframe so that spectrum needs to be addressed. Our goal is to say no matter what kind of data, where it is,We will allow you to manage it like code.

The post DataOps is more than just DevOps for data appeared first on SD Times.

SD Times Open-Source Project of the Week: Titan

Christina Cardoza — Fri, 08 Nov 2019 14:00:37 +0000

Data is becoming more important than ever, and developers are beginning to realize they need better ways to harness and work with data. The problem, however, is that data isn’t handled the same way development is and therefore it can become a time-consuming and complex process.

“The rise of git, docker, and DevOps has created a new world where developers can easily build, test, and deploy right from their laptop. Despite these advances, developers still struggle to manage structured data with the same speed and simplicity. Techniques like SQL scripts, database dumps, and plain text exports still leave a lot of work for developers,” the Delphix Titan team wrote on a website.

To address this, Delphix open sourced Titan earlier this year. Titan is an open-source project that enables developers to treat data like code.

“The thinking behind Titan is today the way developers develop is locally on their laptop. They pull code from their git repository, they clone that code locally on their laptop, and they go to work. What do they do for data? They are actually copying databases around and they can’t copy a commercial database around. Even if they get that data, they can’t version it. If they do testing that changes the data, then they have to get another copy and it is all a manual process. There is no git for data, and there have been several attempts to make it so we decided we would make our own,” Sanjeev Sharma, vice president and global practice director for data modernization and strategy at Delphix, told SD Times.

Titan is not git for data, but it provides capabilities that help developers manage, version and branch databases locally on their laptops, Sharma explained. The project enables developers to clone, commit, checkout, push and pull data like code. In addition, they can rollback to a previous state, build a test data library and share structured datasets, according to the project’s website. Other features include data versioning, support for off-the-shelf Docker containers, and a command line tool.

“Setting up and tearing down databases for developers has been the bane of the dev workflow. Not only do developers have to decide WHERE and HOW to run the database but they have to struggle with the configuration,” Robert Reeves, CTO of Datical, said in a post. “Of course, containers are perfect for local development, but until Titan, applying the dev workflow to the data just didn’t happen.”

The post SD Times Open-Source Project of the Week: Titan appeared first on SD Times.

The problem with data

John Schmidt — Tue, 24 Sep 2019 17:30:07 +0000

As any business leader will tell you, data is the lifeblood of organizations operating in the 21st century. A company’s ability to effectively gather and use data can make all the difference in its success. But a number of factors can compromise data’s health, making it unmanageable and therefore unusable for today’s businesses. Specifically, data professionals face a dramatic increase in data complexity, variety and scale.

Here, we explain the three categories that keep data professionals awake at night, and why traditional data management practices and methods won’t help.

The three factors derailing your effective data use
Data sprawl, data drift and data urgency conspire against all data professionals. By definition, data sprawl is the dramatic variety of data sources and their volume. Consider systems such as mobile interactions, sensor logs and web clickstreams. The data that those systems create changes constantly as the owners adopt updates or re-platform their systems. Modern enterprises experience new data constantly in different formats, from various technologies and new locations.

RELATED CONTENT: Is DataOps the next big thing?

Data drift is the unpredictable, unannounced and unending mutation of data characteristics caused by the operation, maintenance and modernization of the systems that produce the data. It is the impact of an increased rate of change across an increasingly complex data architecture. Three forms of data drift exist: structural, semantic and infrastructure. Structural drift occurs when the data schema changes at the source, such as application or database fields being added, deleted, re-ordered or the data type changed. Semantic drift occurs when the meaning of the data changes, even if the structure hasn’t. Consider the evolution from IPv4 versus IPv6. This is a common occurrence for applications that are producing log data for analysis of customer behavior, personalization recommendations, and so on. Infrastructure drift occurs when changes to the underlying software or systems create incompatibilities. This includes moving in-house applications to the cloud or moving mainframe apps to client-server systems.

Data urgency is the third factor. It’s the compression of analytics timeframes as data is used to make real-time operational decisions. Examples include Uber ride monitoring and fraud detection for financial services. IoT is also creating ever-increasing sources of transactions that need immediate attention: For example, doctors are demanding input from medical sensors connected to their patients.

Anatomy of past data service incidents resolutions
You might be thinking “The issues of data sprawl, drift and urgency aren’t new and have been around for years,” and you would be correct in your assessment. But their increased frequency and magnitude are new requirements. In the past, these issues were generally isolated and could be dealt with using standard exceptions methods. Let’s look at how data service incidents were resolved in the past (and how they are still resolved in many enterprises).

First, an exception event occurs. It may be flagged by a computer mainline job that ends with an error and is noticed by a data center operator, or a business owner may see odd results in the monthly sales performance report, or a customer calls the service desk to complain about a slow website. In any event, someone in the incident management team or help desk is notified of the exception.

Second, the help desk gathers as much information as they can and makes an assessment of the severity level. For a low severity, they send an email to the application owner and ask them to look into it when they can. If it’s a high severity, they take a more dramatic action and initiate the “Severity 1 Group Page,” which notifies dozens of staff to organize a conference call.

Third, the staff on the conference call works to understand the current issue and its impact, analyzes the problem and determines the root cause, and figures out how to correct the situation and return to normal operations. Dozens of staff are involved because it’s not clear up front what the precise problem or correction is, so anyone that might be able to help is required to attend. The incident recovery often does not result in a permanent solution, and the company needs to know the root cause and how to avoid future occurrences.

Fourth, a postmortem process is initiated to fully understand the root cause and how to avoid it in the future. It could be several weeks to understand what happened, followed by a group review meeting by multiple SMEs and managers, and then a formal report and recommendations for division leaders, internal audit or senior management. Hopefully, the defined recommendations are approved and a permanent resolution is implemented.

Clearly, this four-step process is tedious and expensive, and simply won’t work in today’s reality of increasing data complexity, data variety and data scale. A better approach is required — one that is built on the assumption that data sprawl, data drift and data urgency are the new normal.

DataOps: A new approach for the new normal
Built on the use of DevOps, DataOps is a fundamental change in the basic concepts and practices of data delivery, and completely challenges the usual and accepted way of integrating data. DataOps expedites the on-boarding of new and uncharted data and flowing the data to effective operations within an enterprise and its partners, customers and stakeholders, all the while preventing data loss and security threats. Unlike traditional point solutions, DataOps uses “smart” capabilities of automation and monitoring — specifically as monitoring relates to data in motion, including capturing operational events, timing and volume, generating reports and statistics that provide global visibility of the entire and interconnected system, and notifying operators of significant events, errors or deviations from the norm. Monitoring is especially important now because the data landscape is more fluid and continues to evolve dynamically.

The nature of data and its never-ending creation demands a new approach to its management. No longer can businesses afford the time and resources to tackle data issues. Rather, DataOps presents a new approach that addresses the complexities of the new normal in data management.

The post The problem with data appeared first on SD Times.

A guide to DataOps tools

Jenna Sargent Barron — Wed, 03 Apr 2019 14:00:52 +0000

Ascend empowers everyone to create smarter products. Ascend provides a fully-managed platform for data analysts, data scientists, and analytics/BI engineers to create Autonomous Data Pipelines that fuel analytics and machine learning applications. Leveraging the platform, these teams can collaborate and adopt DataOps best practices as they self-serve and iterate with data and create reusable, self-healing pipelines on massive data sets in hours, instead of the weeks or months.

Attunity enables organizations to gain more value from their data while also saving time and money. Its software portfolio accelerates data delivery and availability, automates data readiness, and intelligently optimizes data management.

Composable Analytics is an enterprise-grade DataOps platform that is designed for business users wishing to create data intelligence solutions and data-driven products.

DataKitchen’s DataOps platform provides users with previously unavailable insights by allowing for the development and deployment of innovative and iterative data analytic pipelines.

Delphix offers a dynamic data platform that connects data with the people who need it most. It reduces data friction by providing a collaborative platform for data operator and consumers. This ensures that sensitive data is secured and the right data is made available to the right people.

The Devo Data Operations Platform is a full-stack, multi-tenant, distributed data analytics platform that scales to petabyte data volumes and collects, stores, and analyzes real-time and historical data. Devo collects terabytes of data per day, enabling enterprises to leverage machine data from IT, operational and security sources. Devo reduces direct operational costs and resources while ensuring visibility across the enterprise’s data landscape, delivering performance up to 50x faster than competing solutions using 75% less infrastructure.

HPCC Systems: the big data platform that enables you to spend less time formatting data and more time analyzing it. This truly open source solution allows you to quickly process, analyze, and understand large data sets, even data stored in massive, mixed schema data lakes. Designed by data scientists, HPCC Systems is a complete, integrated solution from data ingestion and data processing to data delivery. Connectivity modules and third-party tools, a Machine Learning Library, and a robust developer community help you get up and running quickly.

Infoworks’ platform automates the operationalization and governance of end-to-end data engineering and DataOps processes. It also provides role-based access controls so that administrators can control which users have access to certain data sets.

Kinaesis are a leading financial services data consultancy focusing on Data Strategy and Execution through their DataOps methodology. They provide DataOps accelerators and consultancy and partner with leading technology vendors to maximise ROI. They aid clients in delivering a data culture whilst helping them to define their strategic data architecture, building pervasive data management and governance capabilities as opposed to ‘one-off’ fixes. Kinaesis are founders of the DataOps Thinktank community on LinkedIn and Twitter.

Lenses.io is a DataOps platform for streaming technologies like Apache Kafka. Lenses enables a seamless experience for running your Data Platform on-prem, cloud or hybrid and put dataOps in the heart of your business operations. Provides self-service data-in-motion control, build and monitor your data flows whilst security, data governance and data ethics are treated as first-class citizens. As a streaming platform overlay technology, Lenses® integrates with Kubernetes and can run with any distribution of Apache Kafka including AWS MKS and Azure HDInsight.

MapR is a data platform that combines AI and analytics. Its DataOps Governance Framework offers a blend of technology options that can provide an enterprisewide management solution that can help them govern data.

Nexla is a data platform that is hoping to be “the new standard in Data Operations.” It offers data ingestion and integration at scale, Flex API technology, the ability to connect to almost any format, the ability to create inter-company feeds, and provides your data the way you want it.

Qubole is a cloud-native data platform for self-service AI, machine learning, and analytics. It provides end-to-end big data processing that will enable users to more efficiently conduct ETL, analytics, and AI/ML workloads.

Redgate Software: The increasing desire to include database development in DevOps practices like continuous integration and continuous delivery has to be balanced against the need to keep data safe. Hence the rise in database management tools which help to introduce compliance by default, yet also speed up development while protecting personal data. Redgate’s portfolio of SQL Server tools span the whole database development process, from version control to data masking, and also plug into the same infrastructure already used for application development, so the database can be developed alongside the application.

StreamSets is a data integration engine for flowing data from streaming source to modern analytics platforms. It offers a collaborative pipeline design, and the ability to deploy and scale on-edge, on-prem, or in the cloud, map and monitor dataflows for end-to-end visibility, and enforce data SLAs.

Tamr offers a new approach to data integration. It solutions make it easy to use machine learning to unify data silos.

The post A guide to DataOps tools appeared first on SD Times.

Is DataOps the next big thing?

Jenna Sargent Barron — Wed, 03 Apr 2019 13:00:21 +0000

After watching application teams, security teams and operations teams get the -Ops treatment, data engineering teams are now getting their own process ending in -Ops.

While still in its very early days, data engineers are beginning to embrace DataOps practices.

Gartner defines DataOps as “a collaborative data manager practice, really focused on improving communication, integration, and automation of data flow between managers and consumers of data within an organization,” explained Nick Heudecker, an analyst at Gartner and lead author of Gartner’s Innovation Insight piece on DataOps.

DataOps is first and foremost a people-driven practice, rather than a technology-oriented one. “You cannot buy your way into DataOps,” Heudecker said.

Michele Goetz, a principal analyst at the research firm Forrester, explained that DataOps is like the DevOps version of anything to do with data engineering. “Anything that requires somebody with data expertise from a technical perspective falls into this DataOps category,” she said. “Why we say it’s like a facet of DevOps is because it operates under the same model as continuous development using agile methods, just like DevOps does.”

DataOps aims to eliminate some of the problems caused by miscommunications between developers and stakeholders. Often, when someone in an organization makes a request for a new data set or new report, there is a lack of communication between the person requesting and whoever will follow through on that request. For example, someone may make a request, an engineer will deliver what they believe is what is needed, and when the requesters receives it they are disappointed that it’s not what they asked for, Heudecker explained. This can result in increased frustration and missed deadlines, he explained.

By getting stakeholders involved throughout the process, some of those headaches may be avoided. “[CIOs] really want to figure out how do they get less friction in their companies around data, which everybody’s asking for today,” said Heudecker.

Another potential benefit of DataOps is improved data utilization, Heudecker explained. According to Heudecker, these are some of the questions that organizations may start to ask themselves:

“Can I use the data that’s coming into my organization faster?
Are things less brittle?
Can things be more reliable?
Can I react to changes in data schemas faster?
Is there a better understanding of what data represents and what data means?
Can I get faster time to market for the data assets I have?
Can I govern things more adequately within my company because there’s a better understanding of what that data actually represents?”

According to Goetz, for companies that have been journeying down the path of “tightening the bolts” of what is needed from a data perspective and how that supports digital and other advanced analytics strategies, it is clear that they need an operating model that allows development around data to fit into their existing solution development track. This enables them to have data experts on the same team as the rest of the DevOps Scrum teams, she explained.

Organizations that are less mature in their data operations tend to still think in terms of executing on data from a data architecture perspective. In addition, a lot of those less mature companies do not handle data in-house, but will outsource it to systems integrators and will take a project-oriented waterfall approach, Goetz explained.

The companies that are already getting DataOps right are typically going to be the ones that already have a DevOps practice in place for their solution development, whether it’s on the application or automation side, Goetz explained. Those more advanced companies also tend to have a model for portfolio management and business architecture that aligns to continuous development. “They’re recognizing there is an opportunity to better fit into the way that you operate around development with those teams so that data doesn’t get left behind and isn’t building up technical debt,” she said.

According to Goetz, this doesn’t just apply to data systems; it encompasses data governance, which traditionally has been the “final bastion of anything anyone wanted to do with the data. It was always playing cleanup,” she said.

“It’s really fascinating to see how organizations act when the lightbulb goes off and they make the equivalency between DataOps and DevOps,” said Goetz. “It’s like all those barriers start to fall away because they typically have something that’s been in place that they’re able to now fit into instead of fight against.”

Having a DevOps structure in place can ensure DataOps success
According to Goetz, companies that have not at least gone through or adopted some Agile methodologies will have a hard time adopting DataOps.

Goetz explained that over the years, she has seen companies evolve and try to switch from waterfall to Agile. They tend to struggle and make mistakes along the way, at least at first. Unless a company has some of those competencies, they will likely struggle. “So I think there’s definitely some foundations that make it easier to get started in one end of the company,” said Goetz.

DataOps is probably here to stay, though it will be a while before it is widely adopted
DataOps is still in the very early stages, so it’s hard to predict where it will go in the future, or even if it will reach wide adoption or fizzle out, Heudecker explained. However, even if DataOps isn’t here to stay, it will still have some positive lasting effects, Heudecker said. “If it gets companies thinking differently about how they collaborate around data, that’s a good thing,” said Heudecker. “Even if it is a short-term hype and then it kind of fizzles out after a while, companies internalize some of the principles or ideas around the topic, and that’s good.”

Goetz doesn’t see DataOps going away anytime soon. In fact, she said that it is actually accelerating in terms of interest and adoption. The level of interest will vary from company to company, but the groundswell is definitely there, she explained.

In fact, a 2018 survey from data company Nexla and research firm Pulse Q&A revealed that 73 percent of organizations were investing in DataOps last year.

The reason she doesn’t see it going away is that one of the catalysts for DataOps is that organizations are recognizing that they don’t just need to build technical capabilities and install applications anymore. In today’s world, organizations are building their own digital foundations, products, and digital business. According to Goetz, those things require a different way of development and going to market.

“[Those companies] looked at where DevOps came from,” Goetz said. “It came from the product companies, particularly the technology product companies. And they have been successful. And you also see integrators redesigning their development practices around DevOps. So there’s just so much momentum behind it. And there’s better results coming out of these practices in general that I don’t see it going away.”

It may be too early to make any predictions around DataOps
Even though it’s still too early to start seeing any obvious trends, Heudecker has still seen a lot of interest in the topic. Right now it is very vendor-led, he said, but there has been a lot of interest from organizations, too. In particular, companies are interested in learning exactly what it is and whether or not it will benefit them.

Going forward, it will probably be the organizations themselves, not vendors, who will define the best practices, Heudecker explained.

Organizations trying DataOps out are going to be “leading on what those best practices are and how you create a center of excellence around that,” said Goetz.

One trend that Goetz has already seen is that companies are approaching DataOps from the AI side of things. Algorithms have advanced and a lot of the existing AI models have gotten quite good at classifying, categorizing, and doing other data preparation work. And data scientists have gotten good at finding analytics functions and machine learning to execute on their data. They don’t even necessarily have to be data scientists because they don’t have to manipulate the model to optimize it. Things are a bit more premade, and vendor tooling is enabling the citizen data scientist. “You don’t always need to have data science skills to take advantage of a data science model or machine learning model,” Goetz explained.

Another trend she has seen is that the role of architects will likely change in DataOps structures. Architects have historically been ignored because developers don’t want someone telling them what to develop; they just want to sit down and make it. Often, architects are seen as something that will slow teams down and push them into more of a waterfall structure.

But according to Goetz, in stronger Agile practices, architecture actually plays a significant role because it helps define the vision and patterns.

The role of data governance
Many of the regulations that are popping up around governance, such as Europe’s General Data Protection Regulation (GDPR) and California’s Consumer Privacy Act, make handling information and governing it mandatory requirements for what you are going to develop, Goetz explained.

As a result of these new regulations, we are going to start see that privacy and security from a governance perspective aren’t just going to be handled at the CISO level or in data governance teams. These regulations are causing there to be a stronger working relationship between those stewardship teams and data engineering teams, she said.

“It is required to infuse governance capabilities into every aspect of data development or data design,” said Goetz. “That can’t be lost… there’s a symbiotic relationship that is developing, in DataOps specifically, where what you do from a data management and architecture perspective, what you do from a delivery perspective, and what you do for a governance perspective, those are no longer three different silos. It is one single organization, and if there’s only one benefit to going down the route of adopting DataOps, it is that you have a better operating model for data in general, regardless. You will build a better data lake. You will build better pipelines. You will build more secure environments. You will tune your data to business needs better, just by that symbiotic relationship. And I think that that’s the accelerator to not failing in your digital capabilities when data is at the core.”

The DataOps Manifesto
Though it is still in its early days, DataOps already has its own manifesto, similar to the Agile Manifesto.

The DataOps Manifesto places value in:

“Individuals and interactions over processes and tools
Working analytics over comprehensive documentation
Customer collaboration over contract negotiation
Experimentation, iteration, and feedback over extensive upfront design
Cross-functional ownership of operations over siloed responsibilities”

Other principles of DataOps that it lists include continually satisfying customers, valuing working analytics, embracing changing, having daily interactions, self-organizing, and more.

The post Is DataOps the next big thing? appeared first on SD Times.